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ABSTRACT 


This book aims to achieve the following goals: (1) to provide a high-level survey of key ana- 
lytics models and algorithms without going into mathematical details; (2) to analyze the usage 
patterns of these models; and (3) to discuss opportunities for accelerating analytics workloads 
using software, hardware, and system approaches. The book first describes 14 key analytics mod- 
els (exemplars) that span data mining, machine learning, and data management domains. For 
each analytics exemplar, we summarize its computational and runtime patterns and apply the in- 
formation to evaluate parallelization and acceleration alternatives for that exemplar. Using case 
studies from important application domains such as deep learning, text analytics, and business 
intelligence (BI), we demonstrate how various software and hardware acceleration strategies are 
implemented in practice. 

This book is intended for both experienced professionals and students who are interested 
in understanding core algorithms behind analytics workloads. It is designed to serve as a guide 
for addressing various open problems in accelerating analytics workloads, e.g., new architectural 
features for supporting analytics workloads, impact on programming models and runtime systems, 
and designing analytics systems. 
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analytics, parallel algorithms, hardware acceleration 





ОСНО о он ое 1 


LL- Anic ОЕ ООУ: газаа ded ар а ъа ея 1 
2 Peels т 1 
1.3 - Classification of Analytics Applications 51.03.02 ааа: 2 

1.3.1. The Watson DeepQA оо аана ов еа навй 4 

1.3.2 Functional Flow of Analytics Applications ........................ 5 
14 Intended тоте О 04499860 id edae ee edews deeb ereta 9 
Overview of Analytics Exemplars соо паз ttt rE TERET E РЕМ 11 
221 амре О и 
РРО оао а воно 12 
23 СИИИ носна анн САА а сааса 14 
24- ЕК Neirhbor оо 16 
29 Вера 18 
26- ЕЕ О ара 21 
АИ о ооо 23 
28. ро оао i 25 
oe) jeg Tree Le РРСРР Aa O РРИВЕ 28 
Ро beinna e a ENARE 30 
о ea ot a e a ав 33 
ІЛЕ. Monte Carlo МЕ ор арены 36 
ДЛО Mathematical РЕТОНА ааьар 240000450000 04084448 es 38 
2.14 On-line Analytical Processihg дааа адь дь нь ьа 41 
Расо akedi nap ТОТИ 44 
Accelerating Analytics ерене онака аенын ORES ЕРЫ 47 
31 Characterizing Analytics епарха ceed оное 47 

LLI Computational Pattee soci cciosaeterasonretaceusiandesanns 47 

Jace. Wnts C Narre acc chicacanepiamateniaisetlaaaicues 50 
Do рН ООВ i4s ice kinieacandccdurtuneataekeneeaianes 52 


За. System Acceleration ОПРОСОВ: sses овоа валаа ores 52 


Accelerating Analytics in Practice: Case Studies ......................... 57 


О Riaieataaneneashenesme 57 
AD: Deep ВА ара 59 
43 Computational Finance еее ое 62 
oy OLAP Business Intelipence:. А И О 64 
Oo. САИ Е ае атов анан бевазане ево ева 66 
Architectural Desiderata for Analytics ......... usunne aunaren 69 
5.1 Accelerators for Analytics Worllodds: ооо зая 70 
5.2 Bringing it all together: Building an Analytics System .................... 74 
Examples of Industrial Sectors and Associated Analytical Solutions ......... 77 
Поре угаа ваадаара ааваа аваад а анааан ыра 79 


Authors’ Biographies ео а ооо они Фа 113 


CHAPTER 1 


Introduction 


1.1 ANALYTICS: A DEFINITION 


From streaming news updates on smartphones, to instant messages on micro-blogging sites, to 
posts on social network sites, we are all being overwhelmed by massive amounts of diverse data 
(Ihe Economist [2010]). Access to such a large amount of diverse data can be a boon if any 
useful information can be extracted and applied rapidly and accurately to a problem at hand. For 
instance, we could contact all of our nearby friends for a dinner at a local, mutually agreeable and 
well-reviewed restaurant that has table availability for that night, but finding and organizing all 
that information can be very challenging. This process of identifying, extracting, processing, and 
integrating information from raw data, and then applying it to solve a problem is broadly referred 
to as analytics and has now become an integral part of everyday life. 


1.2 ANALYTICS AT YOUR SERVICE 


Tables 1.1 and 1.2 present a sample of key analytics applications from different domains, along 
with their functional characteristics. As these tables illustrate, many services that we take for 
granted and use extensively in everyday life would not be possible without analytics. For exam- 
ple, social networking applications such as Facebook, Twitter, and LinkedIn encode social rela- 
tionships as graphs and use graph algorithms to identify hidden patterns (e.g., finding common 
friends). Other popular applications like Google Maps, Yelp, or FourSquare combine location and 
social relationship information to answer complex spatial queries (e.g., find the nearest restaurant 
of a particular cuisine that your friends like). Usage of analytics has substantially improved the 
capabilities and performance of gaming systems as demonstrated by the recent win of IBM’s Wat- 
son intelligent question-answer system over human participants in the Jeopardy! challenge. The 
declining cost of computing and storage and the availability of such infrastructure in cloud envi- 
ronments has enabled organizations of any size to deploy advanced analytics and to package those 
analytic applications for broad usage by consumers. 

While consumer analytical solutions may help us all to better organize or enrich our per- 
sonal lives, the analytic process is also becoming a critical capability and competitive differentiator 
for modern businesses, governments and other organizations. In the current environment, orga- 
nizations need to make on-time, informed decisions to succeed. Given the globalized economy, 
many businesses have supply chains and customers that span multiple continents. In the public 
sector, citizens are demanding more access to services and information than ever before. Huge 
improvements in communication infrastructure have resulted in widespread use of online com- 
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Table 1.1: Examples of well-known analytics applications 

















Application Principal Goals 

Google search, Bing Web Indexing and Search 

Netflix and Pandora Video and Music Recommendation 
Watson Intelligent Question-Answer System 
Telecom Churn Analysis Analysis of Call-data Records (CDRs) 
Cognos Consumer Insight (CCI) Sentiment/Trend Analysis of BLOGS 
UPS Logistics, Transportation Routing 
Amazon Web Analytics Online Retail Management 

Moodys, Fitch, S&P Analytics Financial Credit Rating 

Yelp, FourSquare Integrated Geographical Analytics 
Oracle, SAS Retail Analytics End-to-end Retail Management 
Splunk System Management Analytics 
Salesforce.com CRM Analytics 

CoreMetrics, Mint, Youtube Analytics Web-server Workload Analytics 
Expedia, Orbitz Travel Planning and Reservation 
Flickr, Twitter, Facebook and Linkedin Analytics | Social Network Analysis 

Healthcare Analytics Streaming Analytics for Intensive Care 
Voice of Customer Analytics Analyzing Customer Voice Records 





merce and a boom in smart, connected mobile devices. More and more organizations are run 
around the clock, across multiple geographies and time zones, and those organizations are be- 
ing instrumented to an unprecedented degree. ‘This has resulted in a deluge of data that can be 
studied to harvest valuable information and make better decisions. In many cases, these large vol- 
umes of data must be processed rapidly in order to make timely decisions. Consequently, many 
organizations have employed analytics to help them decide what kind of data they should collect, 
how this data should be analyzed to glean key information, and how this information should be 
used for achieving their organizational goals. Examples of such techniques can be found in almost 
any sector of the economy, including financial services (Crosbie and Bohn [2003]), government 
(Goode [2011]), healthcare, retail (Richter et al. [2010]), manufacturing, logistics (Armacost 
et al. [2004]), hospitality, and eCommerce (Davenport and Harris [2007]). Appendix A presents 


a more exhaustive list of business analytics solutions and associated industrial sectors. 


13 CLASSIFICATION OF ANALYTICS APPLICATIONS 


As Table 1.2 illustrates, analytics applications exhibit a wide range of functional characteristics. 
‘The distinguishing feature of an analytics application is the use of mathematical formulations 
for modeling and processing the raw data, and for applying the extracted information. These 
techniques include statistical approaches, numerical linear algebraic methods, graph algorithms, 
relational operators, and string algorithms. In practice, an analytics application uses multiple for- 
mulations, each with unique functional and runtime characteristics (Table 1.1). Further, depend- 
ing on the functional and runtime constraints, the same application can use different algorithms. 
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Table 1.2: Key characteristics of the analytics applications 















































Application Key Functional Characteristics 
Google, Bing search Web crawling, Link analysis of the web graph, Result ranking, 
Indexing Multi-media data 
Netflix and Pandora Analyzing structured and unstructured data, Recommendation 
Watson Natural language processing, Processing large unstructured data, 
Artificial intelligence (AI) techniques for result ranking and wagering 
Telecom Churn Analysis Graph modeling of call records, Large graph dataset, 
Connected component identification 
Cognos Consumer Insight Processing large corpus of text documents, Extraction 
and ‘Transformation, Text indexing, Entity extraction 
UPS Mathematical optimization-based solutions for transportation 
Amazon Web Analytics Analysis of e-commerce transactions, Massive data sets, 
Salesforce.com Real-time response, Reporting, Text search, Multi-tenant support 
personalization, Automated price determination, Recommendation 
Moody’s, Fitch, S&P Statistical analysis of large historical data 
Google Maps, Yelp Spatial queries, Streaming and persistent data, Spatial ranking 
Oracle, SAS, Analysis over large persistent and transactional data, Extraction and 
Amazon Retail Analytics Transformation, Reporting, Integration with logistics, HR, CRM 
Hyperic and Splunk Text analysis of large corpus of system logs 
CoreMetrics, Mint, Website traffic/workload characterization, Massive historical data, 
Youtube Analytics Video annotation and search, Online advertisement/marketing 
Expedia, Orbitz Mathematic optimization-based solutions for travel industry 
Flickr, Twitter, Graph modeling of relations, Massive graph datasets, 
Facebook and Linkedin Graph analytics, Multi-media annotations and indexing 
Healthcare Analytics Streaming data processing, Time-series analysis 














Voice of Customer Analytics | Natural language processing, Text entity extraction 





While many of the applications process a large volume of data, the type of data processed varies 
considerably. Internet search engines process unstructured text documents as input, while re- 
tail analytics operate on structured data stored in relational databases. Some applications such as 
Google Maps, Yelp, or Netflix use both structured and unstructured data. The velocity of data also 
differs substantially across analytics applications. Search engines process read-only historical data 
whereas retail analytics process both historical and transactional data. Other applications, such as 
the monitoring of medical instruments, work exclusively on real-time or streaming data. Depend- 
ing on the mathematical formulation, the volume and velocity of data and the expected I/O ac- 
cess patterns, the data structures and algorithms used by analytical applications vary considerably. 
‘These data structures include vectors, matrices, graphs, trees, relational tables, lists, hash-based 
structures, and binary objects. They can be further tuned to support in-memory, out-of-core, or 
streaming execution of the associated algorithm. Thus, analytics applications are characterized 
by diverse requirements but share a common focus on the application of advanced mathematical 
modeling, typically on large data sets. 
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Given the diverse and demanding requirements of analytics and the new technology avail- 
able in systems, it is imperative to perform an in-depth study of various analytics applications. 
Any insights would help us identify: (1) optimization opportunities for analytics applications on 
existing systems and (2) features for future systems that match the requirements of analytics ap- 
plications. Toward achieving these goals, as the first step, we examine the functional workflow of 
analytics applications from the usage to the implementation stages. To motivate the study of ana- 
lytics workloads, we first describe in detail a recent noteworthy analytics application: the Watson 
intelligent question/answer (Q/A) system. 


1.3.1 THE WATSON DEEPQA SYSTEM 


uestion ue: Hypothesis 
Q . Query o. yp . Soft Filtering 
Analysis Decomposition Generation 


Trained 
Models 


. Final Mergin, 
Answer $ Е Synthesis 
Sources and Ranking 


Figure 1.1: Functional workflow of the Watson question-answer system. 
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Watson is a computer system developed to play the Jeopardy! game show against human 
participants. Its goals are to correctly interpret the input natural language questions, accurately 
predict answers to the input questions and finally, intelligently choose the input topics and the 
wager amounts to maximize the gains. Watson is designed as an open-domain Q/A system using 
the DeepQA system, a probabilistic evidence-based software architecture whose core computa- 
tional principle is to assume and pursue multiple interpretations of the input question, to generate 
many plausible answers or hypotheses and to collect and evaluate many different competing evi- 
dence paths that might support or refute those hypotheses through a broad search of large volumes 
of content. This process is accomplished using multiple stages. The first stage, question analysis 
and decomposition, parses the input question and analyzes it to detect any semantic entities like 
names or dates. The analysis also identifies any relations in the question using pattern-based or 
statistical approaches. Next, using this information, a keyword-based primary search is performed 
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over a varied set of sources, such as natural language documents, relational databases and knowl- 
edge bases, and a set of supporting passages (initial evidence) is identified. This is followed by 
the candidate (hypothesis) generation phase which uses rule-based heuristics to select a set of 
candidates that are likely to be the answers to the input question. ‘The next step, hypothesis and 
evidence scoring, for each evidence-hypothesis pair, applies different algorithms that dissect and 
analyze the evidence along different dimensions of evidence such as time, geography, popularity, 
passage support, and source reliability. The end result of this stage is a ranked list of candidate an- 
swers, each with a confidence score indicating the degree to which the answer is believed correct, 
along with links back to the evidence. Finally, these evidence features are combined and weighted 
by a logistic regression to produce the final confidence score that determines the successful can- 
didate (i.e., the correct answer). In addition to finding correct answers, Watson needs to master 
the strategies to select the clues to it’s advantage and bet the appropriate amount for any given 
situation. The DeepQA system models different scenarios of the Jeopardy! game using different 
simulation approaches (e.g., Monte Carlo techniques) and uses the acquired insights to maximize 
Watson's winning chances by guiding topic selection, answering decisions, and wager selections. 


1.3.2 FUNCTIONAL FLOW OF ANALYTICS APPLICATIONS 


The Watson system displays many traits that are common across analytics applications. They all 
have one or more functional goals. These goals are accomplished by one or more multi-stage 
processes, where each stage is an independent analytical component. As Figure 1.2 illustrates, 
execution of an analytics application can be partitioned into three main components: (1) solution, 
(2) library, and (3) implementation. 

‘The solution component is end-user focused and uses the library and implementation com- 
ponents to satisfy user’s functional goal, which can be one of the following: prediction, prescrip- 
tion, reporting, recommendation, quantitative analysis, pattern matching, or alerting (Davenport 
and Harris [2007], Davenport et al. [2010]). For example, Watson’s key functional goals are: pat- 
tern matching for input question analysis, prediction for choosing answers, and simulation for 
wager and clue selection. Usually, any functional goal needs to be achieved under certain runtime 
constraints, e.g., calculations to be completed within a fixed time period, processing very large 
datasets or large volumes of data over streams, supporting batch or ad-hoc queries, or supporting 
a large number of concurrent users. For example, for a given clue, the Watson system is expected 
to find an answer before any of the human participants in the quiz. To achieve the functional and 
runtime goals of an application, the analytical solution leverages well-known analytical disciplines 
such as machine learning, data mining, optimization, data analysis, and simulation and model- 
ing. As we will observe, there is also substantial overlap between these disciplines. For example, 
statistics, data mining, and machine learning disciplines are closely related. Both data mining 
and machine learning use statistical techniques; data mining extracts information via discovering 
known patterns from existing data sources (e.g., finding patterns in customer sales data), whereas 
machine learning /earns by building a model of the underlying system and use it either to answer 
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Figure 1.2: Simplified functional workflow of analytics applications. 


an unknown input query (e.g., recognizing handwritten characters) or find hidden relationships. 
Each analytical discipline can support different problems types: e.g., statistics covers descriptive 
and inferential statistics, data analysis includes both structured and unstructured analysis, ma- 
chine learning and data mining approaches include unsupervised or supervised learning in which 
the learning process uses a training dataset with a set of records, where each record consists of 
values of key input factors that impact the model, called features, and a label which represents the 
corresponding result. 

In practice, an analytics solution is built as a process with four distinct phases: data ingestion 
and pre-processing of input data, extract and transform input data to select appropriate data, build 
an analytical model using the selected data, and then use this model to compute the final decision. 
Each phase can have one or more tasks which are implemented using appropriate models from 
the analytics disciplines. These tasks are linked to build the end-to-end solution. Table 1.3 lists 
the analytics disciplines that are used to achieve functions goals of key business focus areas. As 
illustrated in Table 1.3, in many cases, a functional goal can be achieved by using more than 
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one problem types. The choice of the problem type to be used depends on many factors that 
include runtime constraints, underlying software and hardware infrastructure, etc. For example, 
customer churn analysis is a technique for predicting the customers that are most likely to leave 
the current service provider (retail, telecom, or financial) for a competitor. This technique can 
make use of one of the three analytics disciplines: statistics, machine learning, or data analysis. 
One approach models individual customer’s behavior using various parameters such as duration 
of service, user transaction history, etc. These parameters are then fed either to a statistical model 
such as regression or to a machine learning model such as a decision tree, to predict if a customer 


is likely to defect (Mutanen [2006]). 


Table 1.3: Examples of analytical focus areas, functional goals, and corresponding analytical disci- 








plines 
Focus Areas Functional Goals Analytical Problem Type 
Revenue Prediction Prediction Supervised Learning 
BioSimulation Prediction Rule Engines and Simulation 
Product Portfolio Optimizations Prescription Combinatorial Optimizations 
Financial Performance Prediction | Prediction Data Mining 
Disease Spread Prediction Prediction Supervised Learning 
Topic/Semantic Analysis Pattern Matching Text Analytics 
Semiconductor Yield Analysis Prediction Data Mining 
Cross-Sell Analysis Recommendation Data Mining 
Anomaly Detection Alerting Data Mining and Data Analysis 
Risk Analysis Quantitative Analysis | Supervised Learning 
Retail Sales Analysis Reporting Supervised Learning 
BioInformatics Quantitative Analysis | Data Analysis 

















The second approach models behavior of a customer based on her interactions with other 
customers. This strategy is commonly used in the telecom sector, where customer calling pat- 
terns are used to model subscriber relationships as a graph. This unstructured graph can then be 
analyzed to identify subscriber groups and their influential leaders: usually the active and well- 
connected subscribers. ‘These leaders can then be targeted for marketing campaigns to reduce 
defection in the members of her group (Nanavati et al. [2006]). 

‘The library component is usually designed to be portable and broadly applicable (e.g., the 
DeepQA runtime that powers the Watson system). A library usually provides multiple implemen- 
tations of specific models of the problem types shown in Table 1.3. For a problem type, a solution 
is built using one or more processes or tasks. For example, an unsupervised learning problem can 
be solved using one of the two tasks: classification or clustering (Han and Kamber [2006]). Each 
task can be then implemented by different analytical models, each of which can in turn use one 
or algorithms. For instance, the associative mining model can be implemented using the associ- 
ation rule mining algorithms or using decision trees (Chapter 2). Similarly, classification can be 
implemented using nearest-neighbor, neural network, or naive Bayes algorithms (Chapter 2). It 
should be noted that, in practice, the separation between models and algorithms is not strict and, 
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many times, the same algorithm can be used for supporting more than one models. For instance, 
neural networks can be used for clustering or classification. 

Finally, depending on how the problem is formulated, each algorithm uses different data 
structures and kernels. For example, many algorithms formulate the problem using dense or sparse 
matrices and invoke kernels like matrix-matrix, matrix-vector multiplication, matrix factorization, 
and linear system solvers. These kernels are sometimes intensively optimized for the underlying 
system architecture, in libraries such as IBM ESSL or Intel MKL. Any kernel implementation 
can be characterized according to: (1) how it is implemented on a system consisting of one or more 
processors, and (2) how it optimized for the underlying processor architecture. The system-level 
implementation varies depending on whether a kernel is sequential or parallel, and if it uses in- 
memory or out-of-core data. Many parallel kernels can use shared or distributed memory paral- 
lelism. In particular, if the algorithm is embarrassingly parallel, requires large data, and the kernel 
is executing on a distributed system, it can often use the map-reduce approach (Dean and Ghe- 
mawat [2010]). At the lowest level, the kernel implementation can often exploit hardware-specific 
features such as short-vector data parallelism (SIMD) or task parallelism on multi-core CPUs, 
massive data parallelism on GPUs, and application-specific parallelism on FPGAs or ASICs. 

Although analytics applications have come of age, they have not yet received significant 
attention from the computer architecture community. It is important to understand systems 
implications of the analytics applications, not only because of their diverse and demanding re- 
quirements, but also, because systems architecture is currently undergoing a series of disruptive 
changes. Widespread use of technologies such as multi-core processors, specialized co-processors 
or accelerators, flash memory-based solid state drives (SSDs), and high-speed networks has cre- 
ated new optimization opportunities. More advanced technologies such as phase-change memory 
are on the horizon and could be game-changers in the way data is stored and analyzed. In spite of 
these trends, currently there is limited usage of such technologies in the analytics domain. Given 
the wide variety of algorithmic and system alternatives for executing analytics applications, it 
is often difficult for solution developers to make the right choices to address specific problems. 
Naive usage of modern technologies often leads to unbalanced solutions that further increase op- 
timization complexity. Thus, to ensure effective utilization of system resources, CPU, memory, 
networking, and storage, it is necessary to evaluate analytics workloads in a holistic manner. 

We aim to understand the application of modern systems technologies to optimizing ana- 
lytics workloads by exploring the interplay between overall system design, core algorithms, soft- 
ware (e.g., compilers, operating system), and hardware (e.g., networking, storage, and processors). 
Specifically, we are interested in isolating repeated patterns in analytical applications, algorithms, 
data structures, and data types, and using them to make informed decisions on systems design. 
Toward this goal, we have been examining the functional flow of a variety of analytical work- 
loads across multiple domains (Table 1.1). As a result of this exercise, we have identified a set 
of commonly used analytical models, called analytics exemplars (Bordawekar et al. [2011]). We 
believe that these exemplars represent the essence of analytical workloads and help us identify com- 
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mon computational and runtime patterns across different analytics workloads. Thus, the analytics 
exemplars can be used as a toolkit for performing exploratory systems design for the analytics 
domain. Table 2.1 lists the 14 analytics exemplars and the corresponding functional domains. In 
this book, we rely on these exemplars to illustrate that analytics applications benefit greatly from 
holistically co-designed software and hardware solutions and demonstrate this approach using 
examples from different domains. 


14 INTENDED AUDIENCE 


‘This book is designed for those with a background in computer architecture and compiler design. 
‘The goal of this book is to provide a high-level survey of key analytics models and algorithms, 
without going into mathematical details. The rest of this book is organized as follows: we first 
overview the 14 analytics exemplars and the key algorithms used by each exemplar; we then sum- 
marize computational and runtime patterns of these exemplars. Based on this information, we 
discuss various acceleration opportunities and demonstrate them using case studies of key ex- 
emplars. We conclude by discussing various open issues in accelerating analytics workloads, e.g., 
new architectural features for supporting analytics workloads, impact on programming models 
and runtime systems, and designing analytics systems. We hope this study acts as a call to action 
for computer architects and systems designers to focus future research on analytics. 


Further Reading 

Using analytics in business problems, see Anderson et al. [2009], Apte et al. [2002], Busi- 
ness Week [2006], Cantor et al. [1997], Crosbie and Bohn [2003], Davenport and Harris 
[2007], Davenport et al. [2010], Eckerson [2009], Goode [2011], Harding et al. [2006], 
Madsen [2009], Manyika et al. [2011], NIPS 2009 Workshop [2009], Nisbet et al. [2009], 
Piatetsky-Shapiro [2011], Rexer [2013], Science Special Issue [2011], Shmueli et al. [2010], 
Suman [2006], The Economist [2007]; Analytics solutions: Netflix Bell et al. [2010], Pan- 
dora Glaser et al. [2002], Joyce [2006], Watson Ferrucci et al. [2010], Telecom Churn Pre- 
diction Dasgupta et al. [2008], Ngai et al. [2009], Richter et al. [2010], Cognos Consumer 
Insight (CCI) IBM Institute for Business Value [2011], Sindhwani et al. [2011], UPS Ar- 
macost et al. [2004], Lohatepanont and Barnhart [2004], Hyperic and Splunk Splunk Inc. 
[2011]; Analytics products: R, Weka Hall et al. [2009], STATISTICA, RapidMiner, Oracle, 
SAP, and IBM Bhattacharya et al. [2009], IBM Corp. [2011]; Academic references Bekker- 
man et al. [2011], Han and Kamber [2006], Leskovec et al. [2014], StatSoft Inc. [2010], 
Wu et al. [2008]. 





CHAPTER 2 


Overview of Analytics 
Exemplars 


21 EXEMPLAR MODELS 


Table 2.1 presents the 14 analytics exemplars with their application domains and Table 2.2 lists 
the associated analytics problem types. As these tables illustrate, each analytics exemplar can 
be used in multiple domains and can address multiple functional goals. Further, an application 
domain can use one or more exemplars (e.g., marketing). These analytic models span the key an- 
alytics disciplines such as data mining, machine learning, statistics, simulation, and data analysis. 
Finally, an exemplar can be used in one or more analytics phases, e.g., the data ingestion and pre- 
processing phase can use text analytics or time-series processing, the transform and load phase can 
use on-line analytical processing or graph analytics, the model building phase can use regression, 
clustering, decision tree, and finally the decision processing can use mathematical programming, 
Monte Carlo methods, decision trees, or graph analytics. In the remainder of this chapter we 
discuss these models in detail; for every model, we first discuss the target application domains, 
then outline the basic idea, and then summarize the key implementation algorithms. 


Table 2.1: Analytics exemplar models and their application domains 











Analytics Exemplar Key Application Domains 
Regression Analysis Social Sciences, Marketing, Economics 
Clustering Marketing, Medical Imaging, Document Management 


Nearest-Neighbor Search 
Associative Rule Mining 
Recommendation Systems 
Neural Networks 

Support Vector Machines 
Decision Tree Learning 
Time Series Processing 
Text Analytics 

Monte Carlo Methods 
Mathematical Programming 
On-line Analytical Processing 
Graph Analytics 





Computational Biology, Image Processing 

Retail Analysis, Bio-Informatics, Intrusion Detection 

Online Media and e-commerce 

Image and Speech Recognition, Fraud Detection 
Bio-Informatics, Document Classification, Financial Modeling 
Medical Diagnostics, Fraud Detection, Marketing, Manufacturing 
Medical Informatics, Geology, Economics 

Web Search, Medical Informatics, Bio-Informatics 
Computational Finance, Insurance Risk Modeling 

Routing, Scheduling, Manufacturing 

Sales/Marketing, Retail, 

Social Analysis, Computational Neuroscience, Logistics 








11 


12 2. OVERVIEW OF ANALYTICS EXEMPLARS 
Table 2.2: Analytics exemplars and corresponding problem types 











Analytics Exemplar Analytics Problem Type 

Regression Inferential Statistics, Supervised Learning 
Clustering Data Mining, Unsupervised Learning 

Nearest Neighbor Search Data Mining, Unsupervised Learning 

Association Rule Mining Unsupervised Learning 

Recommender Systems Unsupervised Learning 

Neural Networks Machine Learning, Supervised/Unsupervised Learning 
Support Vector Machines Machine Learning, Supervised Learning 

Decision Trees Machine Learning, Supervised Learning 

Text Analytics Data Analysis, Supervised/Unsupervised Learning 
Time Series Processing Data Analysis, Unsupervised Learning 
Mathematical Programming | Mathematical Optimization 

Monte Carlo Methods Simulation 

Online Analytical Processing | Data Analysis 

Graph Analytics Data Analysis 











2.2 REGRESSION ANALYSIS 


Regression analysis is a classical statistical technique used to model the relationship between a 
dependent variable and one or more independent variables. Specifically, regression can predict 
how the value of the dependent variable can change when any one of the independent variables is 
varied while the remaining independent variables are fixed. Regression analysis is primarily used 
for prediction and forecasting purposes and also to discover relationships between dependent and 
independent variables, e.g., to estimate conditional expectation of the dependent variable given 
multiple independent variables. Thus, regression can be viewed as an example of a classifier which 
uses supervised learning which uses independent variables for training. Regression analysis has 
been used in many application domains, including economics, psychology, social sciences, mar- 
keting, healthcare, and computational finance, e.g., for predicting house prices based on informa- 
tion such as crime rate, population, number of rooms, property tax; for predicting airfares on new 
routes using the location and airport information such as population, average income, passenger 
estimates, number of airport gates, etc. (Shmueli et al. [2010]). 


Basic Idea: Formally, any regression model relates a dependent variable Y to a regression function 
f of independent variables (regressors) X and unknown parameters, В: У = f(X, В). The quality 
of the predication depends on the amount of information available about the independent variable 
X. ЕК is the length of the vector of unknown parameters В, then the regression analysis is possible 
only if N > К, where М is the number of observed data points of the form (Y, X). If N =k, 
and the regression function f is linear, then the equations У = /(Х, В) can be solved exactly. 
However, if the regression function is nonlinear, then either many solutions exist or a solution 
may not exist. When N > k (also called an over-determined system), a best-fit strategy is usually 
used to predict the values of Х. We now discuss some of the key regression algorithms. 
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Linear Regression: Models the relationship between a scalar dependent variable Y and one or 
more regressor variables X using linear regression functions. A dependent variable y; can be 
approximated as a linear combination of regressors x;. The following represents a simple linear 
regression model for n data points with independent variables x; j , and regression coefficients В;: 


Vi = Віхи +--+ + Врхр + Е; = ГВ + є і = 1,...,п, (2.1) 


where Т denotes the transpose, so that x/ В is the inner product between vectors x; and В. Any 
solution for the linear regression model aims to estimate and infer the values of the regression 
coefficients В; using estimation methods such as Ordinary or Generalized Least Squares methods. 


Nonlinear Regression: ‘The nonlinear regression model is characterized by the fact that the de- 
pendent variables Y are related to the regressor variables X via a nonlinear relationship on one or 
more unknown parameters, В. A nonlinear regression model has the following form: 


Yi = ГОВ) Ре i=1,...,n, (2.2) 


where the function f (xi, В) is nonlinear аз it cannot be expressed as a linear combination of 
the parameters, В апа є; аге random errors. Common nonlinear functions include exponential 
decay/growth, logarithmic, trigonometric, power, and Gaussian functions. 

For nonlinear regression, regression parameters В can be estimated by minimizing a suit- 
able goodness-of-fit expression with respect to В. One popular approach is to minimize the sum 
of squared residuals using the nonlinear least squares method that uses the Gauss-Newton nu- 
merical method. In some cases, maximum likelihood or weighted least squares estimation is used. 
Alternatively, in some cases, the nonlinear function can be transformed to a linear model. ‘The 
transformed model can then be estimated using linear regression approaches. 


Logistic Regression: The logistic regression is used for prediction of the probability of occurrences 


Te: The variable z is a measure of 


the total contributions of all the independent variables while f(z) represents probability of a 


of an event by fitting data to a logistic function, f(z) = 


particular outcome, given the set of independent variables. ‘The variable z is usually defined as 
2 = Bot Вах! +--+ + Вкхь (2.3) 


and for any value of z, the output f(z) varies between 0 and 1. 
Тһе parameters В; can then be estimated by maximum likelihood approach via the itera- 
tively re-weighted least squares method that can use the Gauss-Newton numerical algorithm. 
Logistic regression is one of the most widely used analytics algorithms both as a standalone 


method for binary or multi-nominal classification or as a kernel in other classifiers such as neural 
networks (Bengio et al. [2014], Вехег [2013]). 
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Further Reading 

Algorithms: Han and Kamber [2006], Wu et al. [2008]; Packages: R (The R Foundation), 
Weka (Hall et al. [2009]), IBM SPSS (IBM SPSS [2010b], RapidMiner Rapid-i), STATIS- 
TICA (The StatSoft Inc.), Oracle Data Miner (Oracle Corp. [b]), and SAS (SAS Institute 
Inc.); Applications: Shmueli et al. [2010], Smyth [2002], StatSoft Inc. [2010]. 





2.3 CLUSTERING 


Clustering is a process of grouping together entities from an ensemble into classes of entities 
that are similar in some sense. Clustering is also called data segmentation (Han and Kamber 
[2006], Shmueli et al. [2010]) as it partitions large datasets into segments of similar and dis- 
similar datasets. Clustering is an example of unsupervised machine learning and is being used 
in a wide variety of applications, e.g., in market segmentation for partitioning the customers ac- 
cording to gender, interests, etc., in gene sequence analysis to identify gene families, in medical 
imaging for differentiating key features in PET scan images, and in clustering documents based 
on semantic information. 


Basic Idea: Any clustering algorithm needs to effectively identify and exploit relevant similarities 
in underlying potentially disparate data sources. The similarity can be expressed using either geo- 
metric distance-based metric (e.g., using either Euclidean or Minkowski metric, Kruskal [1964]) 
or conceptual relationships in the data. The input data can be noisy, of different types (interval- 
based, binary, categorical, ordinal, vector, mixed, etc.), have high dimensions, and have large size 
(e.g., millions of objects). Thus, the key challenges before any clustering algorithm are the effec- 
tive use of the similarity metric, exploitation of intrinsic characteristics of the data, support for 
large number of dimensions, large data sets, and different cluster shapes. 

Current approaches to solving the clustering problem can be broadly classified into para- 
metric and nonparametric approaches. ‘The parametric or model-based methods assume that the 
input data is associated with a certain probability distribution and the clustering to designed to 
fit the data to some mathematical model (Han and Kamber [2006]). The nonparametric meth- 
ods exploit spatial properties (e.g., distance or density) of the input data. Clustering algorithms 
also differ depending on the dimensionality of the datasets (Kriegel et al. [2009]). We now dis- 
cuss some of the key clustering algorithms; K-Means and Hierarchical clustering algorithms are 
examples of nonparametric clustering, and EM clustering is an example of parametric clustering. 


K-Means Clustering: The k-means clustering algorithm (Hartigan and Wong [1979], Mac- 
Queen [1967]) is an example of a partitioning method that constructs К partitions from a database 
of n objects, where each partition represents a cluster and К < п. Each cluster contains at least one 
object and an object lies in only one cluster. A partitioning algorithm creates an initial assignment 
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of objects to partitions and then iterative relocation technique are used (Han and Kamber [2006]) 
to move objects among the groups. The objects in a group are considered to be closer to each than 
objects in different clusters. In the k-means algorithm, the cluster similarity is measured as the 
mean value of objects, which can be viewed as the cluster’s centroid or center of gravity. 

The k-means algorithms works only on data whose mean сап be computed (e.g., it is not 
possible to compute a mean value for categorical datasets). The algorithm also requires the number 
of clusters k to be defined a priori. The algorithm cannot discover clusters with different shapes 
and is sensitive to noise and outliners as they can significantly affect calculations of the mean 
value. One variation of the k-means problem, the k-modes method uses modes as a measure of 
similarity for categorical objects and a frequency-based update method (Chaturvedi et al. [2001]). 


Hierarchical Clustering: Hierarchical clustering methods group data objects into a tree of clus- 
ters. Hierarchical clustering often produces data clusters that can be viewed graphically using 
a dendrogram. Hierarchical methods can be classified into: (1) agglomerative, i.e., those use a 
bottom-up strategy to construct increasing large clusters until certain termination conditions are 
met; and (2) divisive, 1.е., those that start from a single cluster and subdivide it into smaller pieces 
until termination conditions are met. 

‘The most popular hierarchical clustering algorithm, BIRCH (Zhang et al. [1996]), uses the 
agglomerative strategy. BIRCH uses a two-step strategy to improve I/O scalability and clustering 
flexibility. In the first micro-clustering stage, it uses the hierarchical clustering strategy to build an 
initial set of in-memory clusters, that summarize the information in the original data. The second 
macro-clustering phase processes these summarized in-memory clusters (1.е., it does not fetch the 
raw data again) using any clustering method, e.g., iterative partitioning, and computes the final 
clustering. 


EM Clustering: The Expectation- Maximization (EM) clustering is a model-based (parametric) 
approach that extends the k-means partitioning algorithm. The EM clustering algorithm assumes 
that the underlying data is a mixture of the k probability distributions (referred to as component 
distributions), where each distribution represents a cluster. The key problem for any model-based 
algorithm is to estimate the parameters of the probability distributions so as to best fit the data. 
The EM algorithm (Dempster et al. [1977], Han and Kamber [2006]) is an iterative re- 
finement algorithm that extends the k-means paradigm (the EM algorithm also uses a pre- 
determined number of clusters): it assigns a data item to a cluster according to a weight rep- 
resenting the probability of membership (unlike the cluster mean metric used in the k-means al- 
gorithm). The new means of these clusters are then computed using the weighted measures. The 
most common version of the EM algorithm learns a mixture of Gaussian distribution (Bilmes 
[1998]). It starts with an initial estimate of the parameters of the mixture model, referred to as 
the parameter vector. Each data item is assigned a probability that it would possess a certain set of 
attributes given that it was a member of a given cluster. The items are then re-scored against the 
mixture density produced by the parameter vector and then the items are used to update the pa- 
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rameter estimates. The complexity of the EM algorithm is linear in d (the dimensions or features 
of the input data), п (the number of data items), and ¢ (the number of iterations) (O(dnt)). 


Further Reading 

Algorithms: K-means (Chaturvedi et al. [2001], Hartigan and Wong [1979], MacQueen 
[1967]), Hierarchical Clustering (Chiu et al. [2001], Zhang et al. [1996]), EM (Dempster 
et al. [1977]), PROCLUS (Aggarwal et al. [1999]) and CLIQUE (Agrawal et al. [1998]); 
Packages: R (The R Foundation), Weka (Hall et al. [2009]), IBM SPSS and InfoSphere 
DataMining (IBM Corp. [c], IBM SPSS [2010a]), RapidMiner (Rapid-i), STATISTICA 
(The StatSoft Inc.), Oracle Data Miner (Oracle Corp. [b]) and SAS (SAS Institute Inc.); 
Applications: Han and Kamber [2006], Nisbet et al. [2009], Shmueli et al. [2010]. 





2.4 NEAREST NEIGHBOR SEARCH 


Nearest neighbor search is an optimization problem for finding the closest points in a metric 
space. This problem of identifying points from an ensemble of points that are in some defined 
proximity to a given query point has been applied for classification and clustering purposes in 
multiple application domains such as distributed systems, image processing, data mining, com- 
putational biology, data compression, and machine learning. The notion of proximity varies from 
domain to domain and is usually formulated using a suitable metric function (e.g., Euclidean 
distance for spatial proximity). For example, the online media providers such as Netflix and Pan- 
dora use nearest neighbor algorithms to suggest movies or songs that match a particular taste of 
a particular user (Bell et al. [2010], Joyce [2006]). Nearest neighbor algorithms have also been 
used for finding similarities in multi-media data (e.g., related images or videos) to detect any 
copyright violations (Boiman et al. [2008], Indyk [2004]). Other well-known applications of the 
nearest neighbor search algorithm include control systems, robotics, and drug discovery (Beyer 


et al. [1999], Jadbabaie et al. [2003], Stanton et al. [1999]). 


Basic Idea: Formally, the nearest neighbor problem can be defined as follows. Given a set S of 
п points in some metric space (X, d), the problem is to preprocess $ so that given a query point 
р Є X, one can efficiently find a q є S that minimizes d(p, 4). In practice, several variations of 
this definition are implemented as per input data characteristics and runtime constraints. Broadly, 
the existing sequential nearest neighbor solutions can be classified as per the dimensionality of 
input data, type of metric used in proximity calculations, result cardinality (e.g., top-k and all- 
pairs ), and data size (e.g., Terabyte datasets). 

‘The key aspect of any nearest neighbor algorithm is the metric function used for calculating 
the proximity distance between the input data points. The most widely used metric in the nearest 
neighbor algorithms is the Euclidean distance: distance between any two points р; and р; in 
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a d-dimensional space can be computed as |p; — р; Peppa (ра = Pik)’, where pix is ће К" 
component of the vector p;. In practice, the Euclidean distance has proven to be effective for 
low-dimensional data. For high-dimensional data, more generalized forms of metric distances 
are employed (e.g., Hamming distance; Uhlmann [1991]). In the case of two-dimensional data, 
the nearest neighbor problem can be solved by using Voronoi diagrams (Omohundro [1987]). 
As the dimensionality become very high, the distance calculations become ineffective by the curse 
of dimensionality. For datasets with very high dimensionality (е.р., in computer vision), these 
search algorithms provide sub-linear performance. One approach to deal with this inefficiency 
is to define an approximate version of the nearest neighbor problem: this version of the problem 
identifies points from an ensemble of points whose distance from the given query point is no 
more than (1 + є) times the distance of the true k” nearest-neighbor. 


K-d Trees: The k-d tree algorithm addresses the precise nearest neighbor problem (Bentley [1975, 
1980], Friedman et al. [1977]). The k-d tree is a binary tree used for representing k-dimensional 
data using recursive hyperplane decomposition. Each node of the k-d tree represents a region of 
the input dataset and its partitioning. Each level of the k-d tree covers the entire dataset. In k 
dimensions, a record is represented by К keys, where each key represents a position in the k” 
dimension. The k-d tree is then constructed by recursively selecting one of the k coordinates as 
the discriminator dimension and then partitioning the dataset into the subset of vectors according 
to a certain partition value. During the query process, the tree is recursively traversed from the 
root: at every level, the value of the discriminator coordinate of the query record is compared 
against the partition value and either the left or right path is chosen for further traversal. When 
the traversal reaches a leaf node, the query record is compared with the records in the leaf, and a 
list of the m closest records is maintained. Overall, for a dataset of N k-dimensional records, the 
tree search requires O(logN) time, with O(N) space consumption. The К-а tree is very effective 
for small dimensionalities. As the number of dimensions increases, the quality of discrimination 
degrades as the proximity calculations are based only on a subset of coordinates. 


Approximate Nearest Neighbor (ANN): The ANN algorithm uses hierarchical space decompo- 
sition for solving the approximate nearest neighbor problem for low-dimensional data. The ANN 
algorithm represent points in a d-dimensional space using a balanced box-decomposition (BBD) 
tree with O(log n) depth (Arya et al. [1998]). ANN recursively sub-divides the space into a col- 
lection of cells, each of which is either an axis-aligned d -dimensional fat (i.e., the ratio between 
the longest and shortest sides is bounded) rectangle or the set-theoretic difference of two rect- 
angles, each enclosed within the other. Each node of the tree is associated with a cell. Thus, it is 
associated with all points contained within the enclosed cells. Each leaf cell is associated with a 
single point lying within the bounding rectangle of the cell. The leaves of the tree span the entire 
space. The ANN tree has O(n) nodes and can be built in O(dn log n) time. During the querying 
process, for a given query point q, a priority queue of the internal nodes of the BBD-tree is cre- 
ated, where priority of a node is inversely related to the distance between the query point and the 
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cell corresponding to the node. The highest priority node is then selected for recursive descent 
toward the leaves. As the descent progresses, the priority queue is updated appropriately. Let p 
denote the closest point seen so far; as soon as the distance from q to the current leaf exceeds 


саат. the search can Бе terminated. 


Locality Sensitive Hashing (LSH): The locality-sensitive hashing (LSH) algorithms are de- 
signed for solving the approximate nearest neighbor problem for very high-dimensional data sets 
(e.g., with a million feature vectors). The key idea behind the LSH algorithms is to hash points 
using several hash functions to ensure that for each function, the probability of collision is much 
higher for points that are close to each other than for those that are far apart (Andoni and Indyk 
[2008], Indyk and Motwani [1998]). Then for the query point, one can determine its nearest 
neighbors by the hashing the query point and retrieving the points in the bucket containing that 
point. The LSH method relies on a family of hash functions that have the property that if two 
points are close, then they hash to same bucket with high probability; if they are far apart, they 
hash to the same bucket with low probability. 


Further Reading 

Basic idea: Omohundro [1987], Uhlmann [1991]; Algorithms: KD-Tree (Bentley [1975, 
1980], Friedman et al. [1977]), ANN (Arya et al. [1998]); LSH (Andoni and Indyk [2008], 
Datar et al. [2004], Indyk and Motwani [1998], Wang et al. [2014]), Ball Trees (Omohundro 
[1987, 1989, 1991]), Metric Trees (Ciaccia et al. [1997], Liu et al. [2006], Moore [2000]), 
Spill Trees (Liu et al. [2004, 2007]), and Cover Trees (Beygelzimer et al. [2006]); Packages: 
ANN (Mount and Arya [2010]); Applications: Bell et al. [2010], Beyer et al. [1999], Han 
and Kamber [2006], Jadbabaie et al. [2003], Joyce [2006], Nisbet et al. [2009], Stanton et al. 
[1999]. 





2.5 ASSOCIATION RULE MINING 


Association rule mining is a key data mining method used for discovering relationships between 
variables. Agrawal et al. [1993] first proposed using association rule mining for identifying rela- 
tionships between items purchased in retail stores, a process widely known as the market-basket 
analysis. Over the years, this method has been applied to more complex data patterns such as se- 
quences, trees, graphs, etc., and different application domains such as bio-informatics, intrusion 
detection, and web-usage analysis (Agrawal and Srikant [1994], Agrawal et al. [1993], wiki, Wu 
et al. [2008]). 


Basic Idea: Formally, the association rule mining processes a set (or database) D of transac- 
tions, where each transaction Т is a set of items such that Т С Г, where J = {i1,i2,-++ ,im} 
is a set of literals, called items. Each transaction is associated with a unique identifier, called 
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TID. By association rule, we mean an implication of the form X => [;, where X is a set of 
some items in J and J; is a single item in Г that is not present in X. The гше X => У in 
the transaction set D has a confidence c if c% of transactions in D that contain X also contain 
У. The rule Х => У has support s in the transaction set D if 5% of transactions in D con- 
tain X UY (Agrawal and Srikant [1994], Agrawal et al. [1993]). While confidence is a mea- 
sure of the rule’s strength, support corresponds to statistical significance. The support of a rule 
X => Y is defined as supp(X => У) = supp(X UY). The confidence of this rule is defined as 
conf(X => У) = supp(X U Ү)/ѕирр(Х). Given a set of transactions D, the problem of min- 
ing association rules is to generate all association rules that have support and confidence greater 
than the user-specified minimum support (called minsup) and minimum confidence (called min- 
conf). The problem of discovering all association rules can be decomposed into two subproblems 


(Agrawal and Srikant [1994], Agrawal et al. [1993], Hipp et al. [2000]). 


e Find all sets of items (itemsets) that have transaction support above minsup. The support 
for an itemset is the number of transactions that contain the itemset. Itemsets with at least 
minsup are called large (frequent) itemsets, and all others are small itemsets. 


Use the frequent itemsets to discover the desired rules. For every frequent itemset /, we find 
all non-empty subsets of /. Note that the support attribute follows the downward closure 
property: all subsets of a frequent itemset must be frequent. For every such subset a, we 
output а rule of the form a => (l — a) if the ratio of support(/) to support(a) is at least 
minconf. One needs to consider all subsets of / to generate rules with multiple consequences. 
The number of rules can grow exponentially with the number of items, but the choices сап 
be pruned using both minsup and minconf. 


Once the associated rules are computed, further pruning may be required to select the most 
useful rules (Klemettinen et al. [1994]). 

‘The key task in the association rule mining process is to find all itemsets that are frequent 
with respect to a given minimal threshold minsupp. All existing associative rule mining algorithms 
employ the downward closure property of the itemset support for pruning the search space: every 
subset of a new frequent itemset must be frequent. These algorithms can be classified by how the 
search space is traversed to construct itemsets (Hipp et al. [2000]): the breadth-first search (BFS) 
algorithms compute all itemsets of size k — 1 before building the itemsets of size k, while the 
depth-first search (DFS) algorithms, hierarchically compute all possible itemsets of size k from 
а list of frequent itemsets of size j (j < К), before processing other itemsets of size j. 

Frequent and potentially frequent itemsets are called candidate itemsets. There are two com- 
mon ways of computing support values of these candidate itemsets. ‘The first approach directly 
counts occurrences of that itemset in all D transactions. The second approach uses set intersec- 
tion to compute the support values of the itemsets. For every item in the transaction set, a list of 
identifiers (TIDs) that correspond to the transactions containing that item is maintained (#451). 
Accordingly, tidlists also exist for every itemset X and denoted Бу X.tidlist. The tidlist of a can- 
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didate С = X UY can be obtained as C.tidlist = X.tidlist N Y.tidlist. The actual support 
of the itemset С can then be computed as |C.tidlist|. 
Based on the strategy for traversing the candidate search space (1.е., DFS vs. BFS) and for 
computing the support values (1.е., direct counting vs. intersection), the association rule mining 
algorithms can be partitioned into four key families (Hipp et al. [2000]) (Table 2.3). Both Apriori 


and Partition algorithms exploit the downward closure of itemsets by iteratively computing large 


Table 2.3: Classification of associated rule mining algorithms 








Traversal Method | Support Computations | Algorithm 
Breadth-first search | Direct Counting Apriori 

Breadth-first search | Intersection Partition 

Depth-first search Direct Counting FP-Growth 
Depth-first search Intersection Eclat and MaxClique 

















candidate itemsets using smaller sized candidate itemsets (Agrawal and Srikant [1994], Savasere 
et al. [1995]). While the Apriori algorithm makes multiple passes over the raw data, the Partition 
algorithm requires only two passes as it partitions the data into non-overlapping partitions such 
that the number of itemsets to be evaluated can be fit into the main memory. The FP-growth 
algorithm use depth-first traversal to build an extended prefix-tree structure, called a FP-tree, 
that compactly represents groups of frequent items as paths (Han et al. [2000]). The item-set 
mining algorithm then traverses FP-tree to compute frequent patterns. The Eclat and MaxClique 
family of algorithms view the creation of candidate itemsets as a lattice that represents structural 
relationships among candidate itemsets, and use two strategies, one based on equivalence classes 
and another on maximal cliques in a hypergraph to predict candidate itemsets. These itemsets 
logically induce a sub-lattice that is then traversed depth-first to generate all frequent itemsets 


(Zaki et al. [1997], Zaki [2000]). 


Further Reading 

Basic idea: Agrawal and Srikant [1994], Agrawal et al. [1993], Hipp et al. [2000], Klemet- 
tinen et al. [1994]; Algorithms: Apriori (Agrawal and Srikant [1994]), Partition (Savasere 
et al. [1995]), FP-Growth (Han et al. [2000, 2004]), Eclat and MaxClique (Zaki et al. 
[1997], Zaki [2000]); Packages: R (The R Foundation), Weka (Hall et al. [2009]), Rapid- 
Miner (Rapid-i), STATISTICA (The StatSoft Inc., SAS SAS Institute Inc.), Microsoft 
SQL Server (Microsoft Corp.), IBM InfoSphere DataMining (IBM Corp. [c]), Oracle Data 
Miner (Oracle Corp. [b], and SAP Legler et al. [2006]); Applications: Davenport and Har- 
ris [2007], Davenport et al. [2010], Han and Kamber [2006]. 
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2.6 RECOMMENDER SYSTEMS 


‘The goal of a recommender system is to predict interest of users for a set of items and based on the 
interest, provide meaningful recommendations. This method is widely used by a variety of online 
e-commerce and media companies, e.g., book recommendations on Amazon, movie recommen- 
dations by Netflix, or song recommendations from Spotify. Recommender systems differ from 
associated rule mining іп that recommendation systems make their selections based on the exter- 
nal information (e.g., collective ratings for a book from multiple readers or user- and item-specific 
attribute knowledge), rather than using implication rules. 


Basic Idea: A recommendation task can be defined as a process to predict likely preferences of an 
user and make recommendations that match these preferences. Broadly, recommendations can 
be classified into the following categories: (1) non-personalized recommendations recommend 
items based on the average rating given by other customers on the items; (2) attribute-based rec- 
ommendation system recommend products based on the product properties (e.g., genre of a book); 
(3) item-to-item correlation that recommends items based on the set of items that the customer 
has already interested in; and (4) people-to-people correlation that recommends items based on 
the correlation between the customer and other customers that have purchased items (Schafer 
et al. [1999]). In general, these recommendation system methods can be classified into Collabo- 


rative Filtering (CF), Content-based recommendation, and hybrid approaches that combine both 
methods (Melville and Sidhwani [2010]). 


Collaborative Filtering (CF): This approach uses social collaboration to aggregate ratings for 
items in a domain, and exploits similarities in the ratings for determining if an item can be rec- 
ommended. The CF methods can be further classified into neighborhood-based (or memory-based) 
and model-based approaches. 

In the neighborhood-based CF approach, a subset of related users (a neighborhood) is se- 
lected based on their similarity to the given user, and their collective ratings are used to predict for 
the user. The most common measure of similarity is the Pearson correlation coefficient between 
two ratings. One can also treat the user ratings as a vector in an m-dimensional space and com- 
pute similarity based on the cosine of the angle between them. This approach can be also applied 
to the item-to-item correlation where rather than matching similar users, a user’s rated items are 
matched to similar items. 

The model-based CF approach formulates the CF problem as a statistical model of user 
ratings and solves it by estimating its parameters. Current model-based algorithms use matrix 
factorization models to detect similarities between users and items that is induced by some hid- 
den (/2ѓепі) lower dimensional structure present іп the data. These models take as an input an х т 
ratings matrix г whose each element гу represents rating of user i for the item j. The ratings ma- 
| 


trix is then factored into two matrices, W and Н, where W = [w1 ..., wn]* is ann x k matrix, 


and Н = [hy,...,/Am]’] is an х m matrix, and К is the number of latent dimensions. The fac- 
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torization learns k-element feature vectors Wy, hi such that inner products wf h; approximate the 
known preference ratings гиг. 

An alternative approach applies the non-negative matrix factorization (NNMF; Lee and 
Seung [1999]) formulation to the recommendation problem. The non-negative constraints on the 
factor matrix leads to a part-based interpretation, where the ratings of each user can be viewed as 
an additive sum of basis vectors of ratings in the item space. In practice, the matrix factorization is 
implemented using the Alternative Least Squares (ALS) algorithm (Hu et al. [2008]). The ALS 
algorithm is an iterative algorithm which starting a fixed ratings matrix, alternatively computes 
one of the factor matrices (W or H), while keeping the other factor matrix constant. At every 
stage, the matrix values are approximated to minimize the quadratic loss function defined as the 


sum of squared differences (Zhou et al. [2008]). 


Content-based Recommendation: Unlike the collaborative filtering approach that only uses col- 
lective ratings of individual users, content-based recommendations finalize their selections based 
on the representations of content that interest the users and representations of contents of the 
items (e.g., movie genres). 

In practice, this problem is commonly addressed using either similarity or classification ap- 
proaches on the associated textual content. In the similarity approach, the content related to user’s 
preference is viewed as query, and the unrated documents related to the items are scored with rel- 
evance/similarity to this query. In the classification approach, the classification model is trained 
using the user rating as a label and content attributes as features. For example, for book recom- 
mendation, fields such as title, author, or subject are used to train a multinomial classifier, with 
К classes, where 1 to К is the scale of ratings (numerical ratings can also be used to train a bi- 
nary classifier). The classification model can be then implemented using a variety of traditional 
classifiers such as the Naive Bayes classifier, k-nearest neighbor, or decision trees. 


Hybrid Approaches: Hybrid approaches combine the collaborative and content-based recom- 
menders to take advantages of both approaches. A simple hybrid approach uses both approaches 
to generate two separate rankings of recommendations and then merge the results to compute 
the final list. The basic approach can be extended to weigh individual rankings, e.g., increase the 
weight of collaborative component as the number of uses accessing an item increases. 

One class of hybrid algorithms maintains content-based profiles of users and uses it for 
finding similar users. A user-profile matrix is computed (as opposed to the user-ratings matrix) 
and a collaborative filtering approach is directly applied to this matrix to suggest recommenda- 
tions. Another approach treats the recommendation process as a classification task, in which both 
collaborative and content information is used together create features for training a classifier sys- 
tem. A related approach, content-boosted collaborative filtering, trains a Naive Bayes classifer using 
documents that describe rated items of each user and uses predictions from this classifier to boost 
an existing sparse user ratings matrix. ‘This pseudo rating matrix is then used to compute a subset 
of users (neighbors) that have similar interests. 
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Further Reading 

Basic idea: Adomavicius and Tuzhilin [2005], Koren et al. [2009], Su and Khoshgoftaar 
[2009]; Algorithms: Collaborative Filtering (Breese et al. [1998], Goldberg et al. [1992], Hu 
et al. [2008], Lee and Seung [1999], Linden et al. [2003], Suand Khoshgoftaar [2009], Zhou 
et al. [2008]); Content-based Recommendation (Melville et al. [2002], Pazzani [1999], Paz- 
zani and Billsus [1997]); Hybrid Approaches (Basu et al. [1998], Claypool et al. [1999], 
Good et al. [1999]; Applications: Bell et al. [2010], Bhasker and Srikumar [2010], Linden 
et al. [2003]. 





2.7 SUPPORT VECTOR MACHINES 


Support Vector Machines (SVMs) is a family of supervised learning methods, primarily used 
for classification and regression analysis (Bennett and Campbell [2000], Vapnik [1995, 1998]). 
While this technique was originally designed for pattern recognition applications, it has found 
applications in a wide spectrum of fields, e.g., astrophysics (parameter estimation, red-shift detec- 
tion), bio-informatics (e.g., gene classification), medical imaging (e.g., brain {MRI processing), 
text analytics (e.g., string-based text classification), time-series prediction (e.g., traffic modeling), 
and financial modeling (e.g., stock indices behavior prediction) (Burges [1998], Guyon [2006], 
Hsu et al. [2010], Noble [2006], Pereira et al. [2009], Rao et al. [2011], Sewell). 


Basic Idea: An SVM is a class of algorithms that use the kernel mapping approaches to map 
original data to high-dimensional feature space and combine statistical learning approaches with 
optimization techniques for classifying the input dataset. A classification application usually in- 
volves operating on two types of data sets: one for training and another for testing. Each instance 
of the training set contains one farget value (1.е., class labels) and several attributs (i.e., features). 
‘The goal of the SVM is to produce a model based on the training data that predicts the target val- 
ues of the tests data, given only the test data attributes. The standard form of SVMs usually classify 
the tests data into two categories. Intuitively, each data point in the training set is first mapped 
to a high-dimensional space, which is then partitioned by a set of hyperplanes constructed by 
the machine. The optimal hyperplanes provide maximum separation (margin) between the data 


points (Boser et al. [1992], Cortes and Vapnik [1995]). 


Core Algorithms: ‘The error rate of an SVM machine on the test data (also called generalization) 
depends on the accuracy in learning a particular training set and its capacity, i.e., its ability to learn 
any training set without error. All SVM algorithms are designed such that there are zero errors in 
learning a particular training set, and the capacity is optimized, i.e., errors in learning any training 
set are minimized. It has been proven that optimal hyperplanes that provide maximum separation 
between data points in a high dimensional space lead to improved generalization (Vapnik [1998]). 
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Further, to construct such optimal hyperplanes, one needs to use only a subset of training data, 
called the support vectors. 

Consider the linearly separable binary classification scenario. We have / training points, 
where each input x; has d features and is in one of the two classes у;= -1 or +1, 1.е., the training 
set has the following form: {x;, yi},i = 1,--- ,J, yi E€ {1,—1}, x € ВЧ. Since the data is linearly 
separable, one can partition the data into 2 classes using a hyperplane x -w + b = 0, where w 
is normal to the hyperplane and Тот is the perpendicular distance from the hyperplane to the 
origin. Let 4+ and d_ be the shortest distances from the separating hyperplanes to the closest 
training examples (1.е., support vectors). For the linearly separable case, the support vector algo- 
rithm simply looks for a separating hyperplane with the maximum margin, 4+ + d_. This can be 


formulated as follows: suppose the training data set satisfy the following constraints: 
xi: w +b > +1, for у = +1 (2.4) 


xi-w+b<-l, for yi = –1. (2.5) 


‘These can be combined into one set of inequalities: 
yi(xi-wt+b)-1>0 Vi. (2.6) 


‘The points (support vectors) for which the equalities in Equations 2.4 and 2.5 hold, lie 
on two hyperplanes with dy = d- = $ and the margin is ТТ Maximizing the margin to the 
constraint in Equation 2.6 is equivalent to finding: 


тіп |м || s.t. yiag-wt+b)—-12>0 Vi. (2.7) 


Minimizing || w || is equivalent to minimizing L w |. Thus, Equation 2.6 can be rewrit- 
ten as i 
тіп = | w |2 st. у (жт: ф() +b) -1>0 Vi, (2.8) 


where ф is a function that maps training data x; into a higher dimensional space. Equation 2.8 сап 
be formulated using Lagrange multipliers (Fletcher [2009]) and solved using the Quadratic or 
Linear Programming (QP/LP) approach to compute values of w and b. Existing general-purpose 
QP algorithms like the quasi-Newton or primal-dual interior point methods are usually used for 
small sized problems. For larger problems, LP solvers based on simplex or interior-point methods 
can be used (Section 2.13). 

For data that is not linearly separable, a kernel function, k(x;, ху), is used for mapping it 
to higher dimensional space. Most current SVM algorithms use one of the following basic kernel 
functions: 


e Linear: K(xi, xj) = x} Xx; 


e Polynomial: К(х;, ху) = (у + xi ху + r)4,y >0 
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• Radial Basis Function (RBF): К(х;, xj) = exp(—y || xi — x; |2). у > 0 


• Sigmoid: К(х;, xj) = tanh(yx] x; + ғ) 


Further Reading 

Basic idea: Bennett and Campbell [2000], Boser et al. [1992], Burges [1998], Cortes and 
Vapnik [1995], Vapnik [1995, 1998]; Algorithms: Lodhi et al. [2002], Loosli and Canu 
[2007], Tsang et al. [2005]; Packages: R (The К Foundation), Weka (Hall et al. [2009]), IBM 
SPSS (IBM SPSS [2010a]), RapidMiner (Rapid-i), STATISTICA (The StatSoft Inc.), Or- 
acle Data Miner (Oracle Corp. [b]), and SAS (SAS Institute Inc.); Applications: Guyon 
[2006], Hsu et al. [2010], Lodhi et al. [2002], Noble [2006], Rao et al. [2011], Sewell. 





2.8 NEURAL NETWORKS 


Artificial neural network is a system inspired by the biological network of neurons in the brain 
and uses a mathematical or computational model for information processing based on a connec- 
tionistic approach. A neural network can be viewed as a massively parallel distributed system that 
has a natural propensity for storing knowledge and mimics the brain in two respects: a neural net- 
work acquires knowledge through a learning process and stores it using inter-neuron connection 
strengths, represented using synaptic weights (Sarle [2002], Stergiou and Siganos). 

Neural networks can be considered as nonlinear statistical data modeling or decision- 
making tools. In practice, they are used to model complex relationships between system input 
and output to infer results for novel inputs or finds patterns in data. Broadly, tasks to which 
neural networks are applied can be classified into: function approximation, classification, data 
mining, inferencing, and cognitive modeling. Neural networks have been applied to a diverse set 
of domains e.g., games (e.g., Go, Chess), music, material science, weather forecasting, medicine, 
chemistry, pattern recognitioni and classification (e.g., image, speech, or character), financial in- 
dustry (e.g., analyzing stock trends), online fraud detection, and many more (Kriesel [2007], Sarle 
[2002], Widrow et al. [1994]). 


Basic Idea: The most common neural network designs are based on the biological systems. In 
a biological network, neurons are linked to each other via weighted edges and when stimulated, 
they electrically transmit their signals via connecting axons. These signals get modified before 
reaching the destination neuron. A neuron gets multiple inputs that have been pre-processed and 
accumulated into a single pulse. A neuron, upon stimulation, may or may not emit a pulse. The 
output may be nonlinear and may not be proportional to the accumulated input (Kriesel [2007]). 

‘Thus, a neural network implementation assumes that a neuron receives a vector input, X 
that is weighted and accumulated to a scalar value as a weighted sum before transmitting it to the 
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receiver neuron (У); w;x;). The weighted sum is an example of the propagation function. Те set 
of such weights represent information storage of a neural network. The output of a neuron may 
not be proportional to the input (1.е., the response у is nonlinear, у = (У); wixi)). The neural 
output is determined by its activation function f(). Multiple scalar output from different neurons 
in turn form the vector input of another neuron. Finally, the weights used in weighting the inputs 
are variables that capture the chemical processes in neurons. 

‘The neural networks can be classified by: (1) underlying network topology, (2) type of learn- 
ing algorithm used, and (3) type of input data. There are three major kinds of network topologies. 


1. Feedforward networks: Feedforward networks consist of layers of neurons with connections 
to any one of the next layers. The neurons are grouped into the following layers: input layer, 
п hidden layers (invisible from outside), and output layer. These connections do not form 
any cycles. 


2. Feedback networks: In feedback or recurrent networks, the state of a neuron at one time can 
influence its state at a future time. Some feedback networks allow direct cycles, in which a 
neuron is connected to itself; others only allow indirect cycles, where a neuron A acts as in 
input to neuron B and, in turn, neuron B is one of neuron A. 


3. Completely linked networks: Completely linked networks permit connections between all 
neurons, except for direct recurrences. Furthermore, the connections need to be symmetric. 


Neural networks are characterized by their capability to familiarize with problems by means 
of training and after sufficient training, to be able to solve unknown problems of the same 
class Kriesel [2007]. Neural networks learn by using a set of training patterns and modify their 
connecting weights as per certain rules (e.g., the Hebbian Rule; Hebb [1949]). There are three 
main types of learning schemes. 


1. Unsupervised learning: In this approach, the training set consist of input patterns and the 
neural network tries by itself to detect similar patterns and classify them into pattern classes. 


2. Reinforcement learning: In reinforcement learning, after completion of a training sequence, 
the network receives a response that specifies whether the result was right or wrong, if 
possible, ow right or wrong it was. 


3. Supervised learning: In supervised learning, the training set consists of input patterns and 
their correct results in form of activation from all output neurons. The objective is to change 
the weights so that not only the outputs match the values in the training set, but for un- 
known, similar patterns, the network produces plausible results. 
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Finally, neural networks are characterized by the kinds of input data. The most common 
kinds of data are categorical and quantitative. The categorical variables take only a finite num- 
ber of possible values in different classes or categories, while the quantitative variables represent 
numerical measurements of some attributes. Learning with categorical values can be viewed as 
classification, while supervised learning with quantitative values is viewed as regression. 

The most common form of neural network is called a perceptron, which is a feed-forward 
network with one input neuron layer connected to one or more trainable weight layers. A single- 
level perceptron (SLP) is a perceptron with an input layer and only one trainable weight layer. 
Neurons in the weight layer use a variety of activation functions (e.g., binary threshold, hyperbolic 
tangent, or weighted sum). A SLP with binary output is considered a linear classifier that maps its 
input real-valued vector x to an output binary value f(x). А perceptron with two or more trainable 
weight layers is called a multi-level perceptron (MLP). An n-stage perceptron has n variable 
weight layers and п + 1 neuron layers, the first layer being the input layer, and п — 1 hidden 
layers. The hidden layers usually have nonlinear differentiable functions, e.g., logistic, softmax, 
and gaussian. Multi-layer perceptrons are usually trained using a supervised learning algorithm 
called back-propagation. The back-propagation algorithm involves two steps: (1) propagation that 
involves forward propagation of input through the neural network and backward propagation of 
the output activations to generate the delta errors; and (2) use the delta errors and input values 
to calculate the gradient of error of the network. The gradient is then used in a simple stochastic 
gradient descent algorithm to find weights that minimize the errors. This algorithm is basically 
a generation of the delta rule that is used for the single-layer perceptrons. First, the derivative 
of the error function with respect to the network weights is calculated and then the weights are 
modified such that the error decreases. For this reason, backpropagation can only be applied to 
nodes with differentiable activation functions. 

Other important classes of neural networks include recurrent networks in which connec- 
tions between neurons form a directed graph. Such networks are able to influence themselves 
by means of recurrents, using network outputs from the following computation steps (Hopfield 
[1982]). Another example is the convolution neural networks (CNNs, Fukushima [2013]) which 
are biological-inspired variants of MLPs. CNNs mimic operations of visual cortex by exploiting 
spatially-local correlations (LeCun et al. [1998]). CNNs are usually organized into in layers of 
two types: a convolution layer and a pooling (sub-sampling) layer, which computes ће max or ау- 
erage value of a particular feature over a region of input data. Convolution and pooling enable the 
neural network to train in a translation-invariant manner (LeCun and Bengio [1995]). Recently, 
CNNs have gained a lot of attention as they are used extensively in deep learning systems: usually, 
a multi-stage neural networks which use CNNs as first layers that are connected to a follow-on 
layers of traditional MLPs (Bengio [2009], Bengio et al. [2014], Schmidhuber [2014]). Deep 


learning systems perform multiple nonlinear feature transformations over multiple stages. 
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Further Reading 

Basic ideas: Hebb [1949], Kriesel [2007]; Algorithms: RBF Networks: (Kriesel [2007]), 
Hopfield Networks (Hopfield [1982]), Elman and Jordan networks: (Elman [1990], Jor- 
dan [1986]), Kohonen networks: (Kohonen [1997], Kohonen and Honkela [2007], Sarle 
[2002]); Packages: R (The R Foundation), Weka (Hall et al. [2009]), IBM SPSS, InfoS- 
phere DataMining (IBM Corp. [c], IBM SPSS [2010a]), RapidMiner (Rapid-i), STATIS- 
TICA (The StatSoft Inc.), and SAS (SAS Institute Inc.); Applications: Kriesel [2007], Sarle 
[2002], Stergiou and Siganos, Widrow et al. [1994]. 





2.9 DECISION TREE LEARNING 


Decision tree learning covers a class of algorithms that use a tree-based model (usually referred 
to as a decision tree) to represent decisions and their possible consequences (Shih [2004]). Intu- 
itively, a decision tree is an encoding of all possible outcomes for a given problem scenario anno- 
tated with their conditional probabilities. Important applications of these algorithms include mar- 
keting, fraud detection, medical diagnostics, agriculture, and manufacturing/production (Murthy 
[1998]). For example, decision trees are used to determine if a potential customer should get a 
loan or to finalize treatment for a cancer patient. In these scenarios, the decision trees are used 
for classification in which the classifier model is used to predict values of categorical variables 
(e.g., yes or no) or continuous variables (e.g., amount of money a particular customer is willing to 


spend). 


Basic Idea: Data classification is a two-stage process: in the first stage, a classifier model is built 
using a supervised learning process, and in the second stage the model is used to predict de- 
cisions for the unknown user inputs (i.e., for inputs that are not used in the training phase). 
‘The supervised learning phase uses a set of training c/ass-/abeled samples, where each sample is 
a n-dimensional feature vector and is associated with the corresponding class-label attribute. The 
class-label attribute can either be categorical or continuous. ‘The distinct values of the class-label 
attribute define a distinct partition or c/ass of the data set. The result of the learning phase is a 
decision tree whose internal nodes represent conjunction of feature predicates and leaves repre- 
sent classifications. After the model has been trained, it is tested for accuracy using a new set of 
testing samples. Once the model’s accuracy has been validated, it is ready for general data sets. 
Most algorithms build the decision trees top-down by iteratively splitting the data set using 
a feature attribute at each step as the splitting parameter. ‘Thus, as one traverses down the tree, 
the partitioning becomes more refined. One of the key factors affecting the performance of any 
decision tree algorithm is the selection of relevant feature attributes. Some features may be sta- 
tistically correlated, thus redundant, and only one of these features is used for splitting. Further, 
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some features may be irrelevant and сап be completely eliminated during the splitting process. 
‘The reduced set of attributes can then generate a probability distribution of classes that is very 
similar to the original data set. 

‘Thus, it is very important to select the feature predicates that can generate the Jest par- 
titioning of the class-labeled data set into classes. The ideal partitioning would create distinct 
classes, each one would contain vectors that have the same values for the feature attributes used 
in the data set splitting. Conceptually, the est partitioning matches the ideal scenario as close as 
possible. Most decision tree algorithms use heuristics called attribute selection measures or splitting 
rules to select the feature predicates. The most popular attribute selection measures are: Informa- 
tion Gain, Information Gain Ratio, and Gini Index (Han and Kamber [2006], Lim et al. [2000], 
Loh and Shih [1997]). We now discuss key decision tree algorithms. 


ID3/C4.5: ID3 (Iterative Dichotomiser) and C4.5 are two decision tree algorithms proposed 
by J. Ross Quinlan that use the entropy-based attribute selection measures (Kotsiantis [2007], 
Quinlan [1986, 1993]). Both algorithms use a greedy approach to build a decision tree in a top- 
down recursive divide-and-conquer manner using a class-labeled training set. Both ID3 and C4.5 
algorithms require three parameters: the training set, D, set of attributes of the training vectors, 
and the attribute selection heuristic. The 103 algorithm uses the information gain measure for 
attribute selection, while the C4.5 algorithm uses the information gain ratio. 

Both algorithms build the tree starting with the root node N, that represents the entire 
training dataset D. If all vectors in D fall in the same class, the node N is considered a leaf 
node, and the process terminates. Otherwise, the algorithm uses the chosen attribute measure 
to determine the attributes will be used for partitioning the dataset. The node N is labeled with 
the splitting criterion that serves as the partitioning test for that node. For every outcome of the 
splitting criterion, a branch is grown from the node N. ‘The dataset gets partitioned as per the 
distinct values of the attribute. The process terminates when all attributes have been used for 
partitioning or the vectors fall into the same class. Decision trees generated Бу ПОЗ or C4.5 suffer 
from the problem of over-fitting of noisy ог outliner data. То address this problem, the tree is 
pruned to remove the least reliable branches after the fully grown tree has been built (called the 
postpruning). The C4.5 algorithm uses an approach called pessimistic pruning (Han and Kamber 
[2006]) that uses the training dataset to determine the pruning strategy. 


CART: Classification and Regression Trees (CART or C&RT; Breiman et al. [1984]) is a family 
of nonparametric recursive tree-building algorithms for predicting continuous dependent vari- 
ables (regression) and categorical predictor variables (classification). Like 103/С4.5, CART builds 
the tree top-down by recursively partitioning the dataset. However, unlike the C4.5 algorithm, 
CART builds binary trees. 

While building the tree, the CART uses two different measures for identifying the splitting 
attribute. For regression problems, a least-squares deviation criteria is used, while for categorical 
variables, impurity measures like the Gini index are employed. Once the splitting attribute is 
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identified, the dataset is partitioned into two groups. The process continues until the stopping 
conditions are satisfied. The CART algorithms also use a postpruning approach to manage the 
resultant tree size. CART uses cost complexity to determine which part of the tree needs to be 
pruned (Han and Kamber [2006], Wu et al. [2008]). This postpruning approach generates a set 
of pruned trees; eventually, the tree that minimizes the cost complexity is selected. 


CHAID: The CHAID (CHI-square Automatic Interaction Detector) algorithm also uses а re- 
cursive tree-building process that partitions the dataset as it build the tree (Kass [1980], Neville 
[1999]). Unlike binary tree created by the CART algorithm, CHAID creates a wide tree with 
multiple branches. As the CHAID algorithm can represent multiple categories effectively, it has 
been widely applied for market segmentation analysis. фе CHAID algorithm uses the Pearson 
CHI-square test as the splitting criterion for ordinal and categorical variables and for continuous 
variables it uses F-tests. 

The CHAID algorithm first prepares the predictor variables by dividing continuous distri- 
butions into a number of categories (binning). Internally, the CHAID algorithm only uses cate- 
gorical variables. Once the initial categories are determined, the algorithm uses the CHJ-squared 
or F-tests to determine statistical independence/significance of the data (Bonferroni-adjusted p- 
value). This information is used for determining the number of branches at an internal tree node. If 
the significance level is below a certain threshold, the branches are merged. Alternatively, a branch 
is split into two. The process terminates when there are no more significant splits or merges. In 
the CHAID algorithm, the last split determines the partitioning of the input dataset. A version 
of CHAID, called the exhaustive CHAID, chooses a partitioning that corresponds to the most 
significant split. 


Further Reading 

Basic idea: Han and Kamber [2006], Lim et al. [2000], Loh and Shih [1997], Quinlan 
[1986, 1993]; Algorithms: ID3/C4.5 (Kotsiantis [2007], Quinlan [1986, 1993]), C&RT 
(Breiman et al. [1984], Han and Kamber [2006]), CHAID (Kass [1980], Neville [1999]), 
and QUEST (Lim et al. [2000], Loh and Shih [1997], Shih [2004]); Packages: R (The 
R Foundation), Weka (Hall et al. [2009]), IBM SPSS (IBM SPSS [2010a]), RapidMiner 
(Rapid-i), STATISTICA (The StatSoft Inc.), Oracle Data Miner (Oracle Corp. [b]), and 
SAS (SAS Institute Inc.); Applications: Han and Kamber [2006], Murthy [1998], Nisbet 
et al. [2009], Shmueli et al. [2010]. 





2.10 TIME SERIES PROCESSING 


A time series is a sequence of observations reported according to the time of their outcome (Falk 
et al. [2006]). Examples of time series data are prevalent in everyday life: prices of commodities 
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during a trading day, daily opening and closing of stock market indexes, hourly weather reports, 
utility consumption charts, etc. Other important application domains that use time series data 
include geology, economics, control systems, medical informatics, process engineering, and social 
sciences (Croarkin and Tobias [2011], Shumway and Stoffer [2010]). Study of the time series 
data is targeted to achieve one of the two goals: (1) understand the basic characteristics of the 
observed data (Analysis); and (2) fit a model to the observed data set and apply it for forecasting 
based on known past values (Forecasting). Both the goals require the pattern for the time series 
to be identified and modeled. 


Basic Idea: Analyzing a time series is different that the traditional data analysis as the data is 
not generated independently, dispersion of data items varies in time, it is often governed by a 
trend, and it can have cyclic components (Falk et al. [2006]). Thus, statistical approaches that 
assume independent and identically distributed data do not apply for the time series. The time 
series data inherently exhibits temporal ordering. In a time series, observations taken closer in 
time are more related than observations taken further apart. Further, an observation at a given 
time can be potentially derived from past observations, rather than from future observations. 

In general, a time series y1,--- , Yn can be viewed as a sequence of random variables у; that 
individually can be decomposed into four components: 


уг = Т + 2:05 + Rt t= 1, п, (2.9) 


where Т; is a monotone function of t, called гел, Z; апа S; reflect long- and short-term non- 
random cyclic influences, called seasonality, and К; represents a random variable capturing errors 
(noise) from the ideal non-stochastic model у, = Т; + Z; + 5, (Falk et al. [2006], StatSoft Inc. 
[2010]). Analysis of any time series data involves identifying underlying trends and seasonalities. 
Time series analysis can be carried out either in time or frequency domain. 


Trend Analysis: In the time-domain, trend analysis identifies the trend component of a time 
series. If the error component in the time series is significant, then the data needs to be pre- 
processed, or smoothed. The smoothing process involves some form of local averaging of data such 
that the irregular components of the individual observations cancel each other out. The most 
common technique is the moving average smoothing that replaces each element of the series 
by either simple or weighted average of n surrounding elements, where n is the width of the 
smoothing window (ARIMA; Box and Jenkins [1976]). The smoothing process can use medians 
instead of means: medians can reduce the effects of the outliners, but in absence of outliners, it 
can produce jagged curves. Medians also does not allow weighting during the smoothing process. 
In cases where the random errors are dominant, the smoothing process can use distance weighted 
least squares smoothing or negative exponentially weighted smoothing techniques (Falk et al. [2006], 
StatSoft Inc. [2010]). Once the errors are smoothed, the monotonous (increasing or decreasing) 
trend component of time series can be represented using linear or nonlinear functions, e.g., using 
the /ogistic function. 
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Seasonality Analysis: The seasonality component of the time series data captures the cyclic fluctu- 
ations in the data. In the time domain, the seasonality can be measured by evaluating dependences 
between elements of a time series separated with a distance or /ag К. In the time-domain anal- 
ysis, auto-correlation and auto-covariances are most commonly used as measures of dependence 
between time series elements. High values of auto-correlation at lag positions that are multiples 
of k exposes a pattern that repeats after every k elements. Auto-correlation values for consecutive 
lags are inter-dependent, i.e., they suffer from serial dependencies; if the first element is closely 
related to the second, and the second to the third, then the first element is related to the third ele- 
ment. One way to examine the serial dependencies is to use partial auto-correlation function that 
excludes all elements within the lag while calculating auto-correlation values (Box and Jenkins 
[1976]). The partial auto-correlation calculations for a lag of 1 are equivalent to computing auto- 
correlation. The serial dependencies within a time series for a lag of value k can be eliminated by 


differencing the series for the value k, i.e., replacing the i” 


element with its difference from the 
(i —k)" element. This transformation can reveal hidden seasonal characteristics by eliminating 
serial inter-dependencies. Secondly, the elimination of the serial dependencies can make the time 


series stationary, 1.е., it has constant mean, variance, and auto-correlation over time (StatSoft Inc. 


[2010]). 


Spectral Analysis: A time series can be viewed аз a sum of a variety of cyclic components. These 
cyclic components are characterized using their wave-lengths as expressed via periods and frequen- 
cies. The frequency-domain (spectral) analysis of a time series aims to decompose the original time 
series into its cyclic components and to compute their frequencies to study their impact on the 
observed data. ‘The spectral analysis uses two periodic sinusoidal functions, sine and cosine, to 
represent the original time series. This representation can be cast as a linear multiple regression 
problem, where the dependent variable is the observed time series, and the regression coefficients 
express the degree to which the respective sine and cosine functions are correlated with the data. 
А n-element time series will be represented by 5 + 1 cosine functions and 5 — 1 sine functions. 
‘Thus, an n-element time series will have n sinusoidal waves. The spectral analysis will identify 
correlations of sine and cosine functions of different frequencies with the observed data. If a 
large coefficient is found, there is strong correlation of the observed data with the corresponding 
frequency (i.e., an influential cycle with that frequency has been found). 

Computationally, the spectral decomposition and identification of sine and cosine coeffi- 
cients can be done using Fourier Transformations. For a n-element time series, the computations 
involve O(n) complex operations. In practice, this process is implemented using the Fast Fourier 
Transform (FFT) algorithm which requires O(n lg(n)) operations. 


Further Reading 
Basic idea: Box and Jenkins [1976], Croarkin and Tobias [2011], Falk et al. [2006], Stat- 
Soft Inc. [2010]; Algorithms: ARIMA (Box and Jenkins [1976]), Exponential Smooth- 
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ing (Croarkin and Tobias [2011], StatSoft Inc. [2010]); Packages: R (The R Foundation), 
Weka (Hall et al. [2009]), IBM SPSS (IBM SPSS [2010a]), RapidMiner (Rapid-i), STA- 
TISTICA (The StatSoft Inc.), and SAS (SAS Institute Inc.); Applications: Nisbet et al. 
[2009], Shmueli et al. [2010], Shumway and Stoffer [2010]. 


2.11 TEXT ANALYTICS 


Text analytics covers computational approaches that process structured and unstructured text data 
to extract and present innate information. ‘The text analytics approaches usually operate on a cor- 
pus of text documents, potentially written in different languages, to transform the input data into 
a form that can be consumed by an user, usually a human subject. The goals of the text min- 
ing methods are to derive new information from data, find patterns across datasets, and separate 
relevant contextual information from noise. Text analytics is an inter-disciplinary field that uses 
techniques from statistics, natural language processing, linguistics, artificial intelligence, infor- 
mation retrieval, and data mining to pre-process, categorize, classify, and summarize the input 
text data. 

One encounters text analytics extensively in daily life: from web searches and navigation, 
reading personalized online news articles, identification and filtering of e-mail spams, help desk 
communications, finding relevant references during research work, online advertisements, etc. 
Text analytics has been applied to a diverse class of domains which include intelligence gather- 
ing, bio-informatics, news gathering and classification, online advertising, medical informatics, 
social sciences, marketing, patent searching, and web searching/navigation. Common operations 
performed by text analytics tasks include pattern matching, lexical analysis, semantic analysis (e.g., 
synonym identification), entity recognition, co-reference, topic-bases classification, correlation, 
and document clustering, link analysis, and tagging/annotation. These capabilities are provided 
in most of the commercial text analytics packages (Davi et al. [2005], Feinerer et al. [2008], IBM 
SPSS [2010c]). 


Basic Idea: The text analytics applications usually take text in its native raw format as input. The 
text can be organized as either a collection of documents or as individual text files. The input text 
can have imperfections like formatting, grammatical and typographical errors or contain unim- 
portant stop-words like an or the. Thus, text analytics methods first pre-process the input data to 
clean it and prepare it for further analysis. Common steps in text pre-processing include (Feinerer 
et al. [2008]): import and parsing, stemming, whitespace elimination and case conversion, stop- 
word removal, synonym identification, tagging, and annotations. 

After the pre-processing stage, data from the input corpus is usually represented using 
specialized data structures. The most common text analytics data structure is the zerm-document 
matrix. This approach represents the processed text аз a bag of words in which the order of tokens is 
irrelevant. The term-document matrix uses document IDs as rows and terms (tokens) as columns. 
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‘The matrix element (i, j ) represent different weightings of aterm j for the document i. Соттоп 
weightings include term frequencies in a document, binary frequencies to represent inclusion or 
exclusion of a term, and inverse document frequency weighting that gives more weight to less 
frequent terms (more widely referred as TF/IDF). The term-document matrix is usually sparse 
and processed in compressed format. Alternatively, the pre-processed text is maintained in the 
native form and processed as strings (Lodhi et al. [2002]). 

We now provide an overview typical applications of text mining that include natural lan- 
guage modeling, text categorization, text classification, semantic analysis, sentiment, and topic 
analysis. 


e Natural Language Modeling: The first step in many text analytics workloads often in- 
volves understanding and processing input in specified in natural language such as English 
or Spanish. The core step in understanding natural language text is to build a language 
model (or an algorithm) that captures salient statistical characteristics of the distribution of 
sequence of words (Bengio et al. [2014]). The most common approach for building a lan- 
guage model uses a neural network to build a distributed representation of a word. The dis- 
tributed representation represents a word using a vector of potentially non-mutual features 
that characterize the meaning of the word, where each vector entry captures the contribution 
of a feature on the word’s meaning. The key approaches for learning word vectors include 
matrix-factorization based Latent Semantic Analysis (LSA) (Deerwester et al. [1990]), lo- 
cal context window approach such as the skip-gram model used in the word2vec system 
(Mikolovy et al. [2013]), and exploiting word-word co-occurrences in the GloVe (Penning- 
ton et al. [2014]). The vector representations can then used for follow-on analysis such as 
similarity, clustering, etc. 


• Text Clustering: Clustering allows (semi)-automatic categorization of text documents ac- 
cording to certain similarity measure (Conrad et al. [2005], Zhao and Karypis [2005b]). 
‘The sparse term-document representation of the text documents can be viewed as a repre- 
sentation of the data corpus in a high-dimensional space. The text data can then be clus- 
tered using traditional clustering algorithms like hierarchical clustering (Zhao and Karypis 
[2005a]) and k-means clustering (Section 2.3). Common similarity measures used for text 
clustering include metric distance, cosine distance, Pearson Correlation, and Extended Jac- 
card similarities (Strehl et al. [2000]). 


* Text Classification: In contrast to clustering, text classification organizes text documents 
into pre-defined classes (Sebastiani [2002]). The class of a document is determined using 
document attributes, e.g., words. One of the most popular method of document classifica- 
tion is the Naive Bayes classifier. The Naive Bayes classifier assumes that all attributes of 
the training example are independent and generates training set to estimate parameters of 
the classification function. Alternative approaches include using support vector machines 


(SVMs) to classify documents (Joachims [1998], Lodhi et al. [2002]). SVMs are ideal for 
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classifying data in a very high-dimensional feature space, e.g., when substrings are used as 
features, SVMs with string kernel functions can be used for classify documents such that 
documents with more common substrings in common are assigned to the same class. 


Semantic Analysis: Given a corpus of text documents, semantic analysis aims to extract 
and represent the innate semantic meaning, as approximated via contextual-usage of words, 
using either statistical or matrix computations. The most common technique used for se- 
mantic analysis, latent semantic analysis/indexing (LSA), uses matrix representation of the 
underlying text documents to analyze relationships between documents and words they 
contain, to infer deeper relationships between words, words and passages, etc. (Landauer 
and Dumais [2008]). The LSA creates a low-rank approximation of the term-document 
matrix using the Singular Value Decomposition (SVD) in which the k largest singular val- 
ues are retained. The rank-reduction approximation eliminates any noise іп the original 
data and merges dimensions associated with the words of similar meanings. ‘The generated 
reduced-dimensional matrices can then be used for determining various similarities such as 
word-word or word-passage similarities. 


Sentiment and Topic Analysis: Sentiment analysis (opinion mining) aims to discover the 
tone of a document or sentence (e.g., positive, negative, or neutral) by applying natural lan- 
guage processing to the text data (Lakkaraju et al. [2011], Pang et al. [2002]). Sentiment 
analysis is related to topic analysis which aims to identify ot topics from the set of input 
text documents. Topic analysis uses various transformations of the term-document matrix, 
in particular, the non-negative matrix factorization (Lee and Seung [1999]), to deduce im- 
portant topics. The non-negative matrix factorization is a family of unsupervised learning 
algorithms that view an object using parts-based additive representations. For topic anal- 
ysis, NNMF can be used to explain a document as a linear combination of topics with 
non-negative weights; higher weights reflecting those topics that strongly influence the doc- 
ument. 


Further Reading 

Basic idea: Feinerer et al. [2008], Manning et al. [2009]; Algorithms: Naive Bayes (Man- 
ning et al. [2009], McCallum and Nigam [1998]), Latent Semantic Analysis (Deerwester 
et al. [1990], Landauer and Dumais [2008], Landauer et al. [1998]), String Kernel Functions 
(Lodhi et al. [2002]), and Non-negative Matrix Factorization (Ding et al. [2006], Но [2008], 
Kanjani [2007], Lee and Seung [1999, 2001], Xu et al. [2003]); Packages: R (Feinerer et al. 
[2008]), SPSS Modeler (IBM SPSS [2010c]), STATISTICA (The StatSoft Inc.), Rapid- 
Miner (Rapid-i), Oracle Data Miner (Oracle Corp. [b]), and SAS (Davi et al. [2005]); Ap- 
plications: Conrad et al. [2005], Gartner [2003], Joachims [1998,?], Lodhi et al. [2002], 
Pang et al. [2002], Sahami et al. [1998], Zhao and Karypis [2005b]. 
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2.12 MONTE CARLO METHODS 


The Monte Carlo method refers to a class of algorithms that employ repeated statistical sampling 
to compute approximate solutions to quantitative problems. These techniques are widely used for 
the applications with inherent uncertainty, such as pricing of various financial instruments, or for 
simulating systems with many coupled degrees of freedom, e.g., simulating behaviors of different 
materials. While the Monte Carlo methods had been originally designed for solving problems 
with probabilistic outcomes, they have also been applied to solve deterministic problems with 
infeasible computational requirements, e.g., solving multi-dimensional definite integrals with a 
large number of dimensions or with difficult boundary conditions. 

The first application of the Monte Carlo methods was to understand behavior of nuclear re- 
actions (e.g., neutrino travel patterns) (Metropolis and Ulam [1949]). Over the years, the Monte 
Carlo methodology has been applied to a wide array of applications domains including differ- 
ent physical sciences, engineering, finance, numerical analysis, and mathematical optimization 
(Fishman [1996]). Perhaps the most popular application of Monte Carlo methods is in financial 
engineering where they are extensively used for insurance risk modeling, pricing various types 
of options and derivatives, e.g., European and American style options, mortgage-backed secu- 
rities, and portfolio analysis (e.g., calculating Value-At-Risk (VaR)) (Boyle [1977], Glasserman 
[2003]). 


Basic Idea: The Monte Carlo method is defined as an approach that represents an approximate 
solution of a problem as a parameter of a hypothetical population. It uses a random sequence of 
values to construct a sample of the population, from which statistical estimates of the parameter 
can be obtained (Halton [1970]). For a problem, the Monte Carlo method repeatedly generates 
independent identically distributed random variables from the same distribution as the problem, 
and then uses them in either deterministic or stochastic model to compute the solution. In its 
simplest formulation, the Monte Carlo method can be viewed as an approach that uses statistical 
sampling to estimate a numerical integral. The standard error of a Monte Carlo estimation de- 
creases with the square root of the sample size. Secondly, the standard error is independent of the 
dimensionality of the integral. Unlike the conventional numerical integration approaches which 
suffer from the curse of dimensionality, the amount of work required by the Monte Carlo ap- 
proach does not increase exponentially in the number of dimensions. Unfortunately, to improve 
the estimation quality, the Monte Carlo approach usually requires a large number of samples. One 
way to improve the efficacy of the Monte Carlo methods is to use one of the variance reduction 
methods, e.g., importance or stratified sampling. 

Sawilowsky [2003] classifies applications of Monte Carlo methods into three types: (1) us- 
ing stochastic techniques for solving deterministic problems, e.g., Monte Carlo integration for 
solving multi-dimensional integral problems; (2) using stochastic techniques for solving prob- 
lems with probabilistic outcomes (e.g., pricing of financial instruments), usually referred to as the 
Monte Carlo Simulation; and (3) using the Monte Carlo method as a tool for generating samples 
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of a particular probabilistic distribution. However, in many practical scenarios, this distinction 
gets blurred. Algorithms that imply the Monte Carlo methods use the following methodology. 


1. Identify a probability distribution function that mimics the problem under consideration 
(e.g., normal distribution for option pricing). 


2. Generate samples from the probability distribution function using a pseudo-random num- 
ber generator. 


3. Pass the sample values through a deterministic or stochastic models (for simulation and 
sampling) to get the final result. 


Irrespective of how the Monte Carlo methods are used in practice, all Monte Carlo imple- 
mentations rely on techniques to generate good random number generators. In practice, it is not 
possible to generate pure random numbers, so Monte Carlo algorithms use either pseudo-random 
(PRNG) or quasi-random (QRNG) number generators. Most PRNG algorithms use bit manipu- 
lation and shuffling (e.g., the multiply-with-carry, xorshift, subtract-with-borrow, etc.) combined 
with recurrence strategies to incorporate pseudo-randomness into the generated sequences. Key 
PRNG algorithms include the Mersenne Twister (MT) generator Matsumoto and Nishimura 
[1998], and the Multiply-with-Carry algorithm (Marsaglia and Zaman [1991]). The most widely 
used quasi-random number generator is the Sobol sequence generator, which generates uniformly 
distributed numbers in the specified set of dimensions. 

In practice, different variants of the original Monte Carlo algorithm are also used. One 
popular variant, the Markov Chain Monte Carlo (MCMC) (Metropolis et al. [1953]), is used for 
generating samples from a probability distribution by using a Markov Chain whose stationery dis- 
tribution is the desired distribution. The MCMC algorithm is used for solving multi-dimensional 
integral problems and it has been also applied to scenarios that exhibit random walk behavior, e.g., 
in statistical physics or graphics applications. Another version, the Quasi-Monte Carlo method, 
uses low-discrepancy samples that are deterministically chosen based on equi-distributed se- 
quences, i.e., appear to fill a region of n-dimensional space evenly. The low-discrepancy samples 
often lead to faster solution time and/or higher accuracy. The Quasi-Monte Carlo methods are 
used for solving multi-dimensional integration problems, e.g., pricing financial derivatives like 


Collateralized Mortgage Obligation (CDOs) (Peskov and Traub [1995]). 


Further Reading 

Basic idea: Halton [1970], Metropolis et al. [1953], Metropolis and Ulam [1949], Saw- 
ilowsky [2003]; Algorithms: Couture and LEcuyer [1994], Marsaglia and Zaman [1991], 
Matsumoto and Nishimura [1998, 2000], Saito and Matsumoto [2008]; Applications: Boyle 
[1977], Fishman [1996], Glasserman [2003], Halton [1970], Metropolis and Ulam [1949], 
Peskov and Traub [1995]. 
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2.13 MATHEMATICAL PROGRAMMING 


In a mathematical optimization or programming problem, one seeks to find an optimal solution 
for a problem scenario as defined by its constraints using a mathematical formulation. Specif- 
ically, solution of a mathematical programming problem aims to minimize or maximize a real 
objective function of real or integer variables, subject to constraints on the variables (Greenberg 
[2010], Holder [2006-2008]). The mathematical programming approach is usually applied to 
cases where a closed-form solution is not (easily) found and one has to settle for the dest available 
solution. It forms the cornerstone of methods used in operations research and other related disci- 
plines like industrial engineering, social sciences, economics, and management sciences. It has also 
been applied in a wide variety of domains including scheduling problems (e.g., transportation), 
manufacturing (e.g., steel production Dutta and Fourer [Fall 2001]), supply-chain management, 
product portfolio optimizations, workforce management, and product configuration selection. 


Basic Idea: A mathematical program is an optimization problem of the form 
Maximize f(x): хє X, g(x) < 0, h(x) = 0, (2.10) 


where X is a subset of R” and is in the domain of the functions, f, g, and h, which map into real 
spaces. The relations, x Е X, g(x) < 0, and h(x) = 0 are called constraints, and the function f is 
called the objective or cost function Holder [2006—2008]. The domain X of the objective function 
is called the search space. A point x is feasible if x є X and it satisfies the constraints: g(x) < 0 and 
h(x) = 0. A point хж is optimal if it is feasible and if the value of the objective function is not less 
than that of any other feasible solution: f(x*) > f(x), for all feasible x (also called candidate or 
feasible solutions). This description uses maximization as the sense of optimization. ‘The problem 
could be easily restated as a minimization problem by appropriately changing the meaning of the 
optimal solution: f(x*) < f(x), for all feasible x. 

The mathematical programming broadly covers approaches used for solving and using 
mathematical programs. It includes theorems to govern the form of the solutions, algorithms 
to seek a solution or ascertain none exists, formulations of problems as mathematical programs 
and theorems about quality of results, etc. In practice, mathematical programming approaches 
are classified according to the properties of the objective function, constraints, and candidate so- 
lutions. We now discuss important classes of mathematical programming: 


Linear Programming: Linear programming approach is a special case of convex programming in 
which the the object function f is both linear and convex and the associated set of constraints 
are specified using only linear equalities or inequalities. Linear programming is used for mod- 
eling problems from operations research such as network flow, production planning, financial 
management, etc. Canonically, the linear programs can be expressed in matrix form as 


Maximize cT x subject to Ax<b х>0, (2.11) 
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where x represents the vector of variables to be determined, c and b are vectors of known coef- 
ficients and A is a known matrix of coefficients. The set of constraints, Ax < b, form a convex 
polytope and any linear programming method would traverse over its vertices to find a point where 
the function, cT x, has the maximum (or minimum) value, if such point exists. Most approaches 
for solving the linear programming problems explore the feasible region over a convex polytope 
defined by the linear constraints of the problem. The simplex method (Dantzig [1963]) is one of 
earliest, but still widely used, methods for solving LP problems. The simplex method constructs 
a feasible solution at a vertex of the polytope and then tests adjacent vertices by traversing a path 
on the edges of the polytope such that the objective function is improved or is unchanged. The 
simplex algorithm is very efficient in practice, requiring 2n to 3n iterations, where n is the num- 
ber of equality constraints and is known to run in polynomial time in certain random inputs. The 
worst-case complexity of the simplex algorithm is exponential in the problem size. This problem 
was addressed by the Ellipsoid method (Schrijver [1998], Todd [2002]) that uses an iterative ap- 
proach to generate a sequence of ellipsoids whose volumes uniformly reduce at every step, thus 
enclosing a minimizer of the convex objective function. The Ellipsoid method was the first so- 
lution to the linear programming problems that ran in worst-case linear time. The Karmarkar’s 
algorithm (Karmarkar [1984], Todd [2002]) that uses an interior-point projection method im- 
proves on the Ellipsoid method’s complexity bounds and runs in polynomial time for both average 
and worst cases. 


Integer Programming: Integer programming (IP) family covers the set of linear programming 
applications that require values of the unknown variables to be integer. If only some of the vari- 
ables are required to be integers, the problems are called mixed-integer programming (MIP) prob- 
lems. Another version of the integer programming problem, 0-1 or binary integer programming, 
requires values of the unknown variables to be either 0 or 1. Integer programming problems of- 
ten observed in scheduling scenarios, e.g., house building with workers with specific skills (IBM 
Corp. [b]) and in assignment-related problems, e.g., airline fleet assignment for optimal utiliza- 
tion or profit maximization (Abara [1989]). Integer programming is also used in a variety of 
distribution or network flow optimization problems. 

Multiple variants of the IP problems are generally NP-Hard and usually solved using two 
classes of heuristics: cutting planes and branch-and-bound. 


Combinatorial Programming: Combinatorial programming (or optimizations) cover methods 
that aim to optimize a cost function based on selection of objects from a set of objects (Pa- 
padimitriou and Steiglitz [1998]). Let N = {1,--- , п} bea set of objects and let {51, S2,--- , Sn} 
be a finite collection of subsets. These subsets are characterized by inclusion and exclusion of 
objects based on certain conditions. Let each subset, S;, be associated with a cost function, 
J (Sx). The combinatorial optimization problem aims to select the subset of objects so as to max- 
imize(minimize) the cost function. This formulation can be viewed as a special case of integer 


programming, whose decision variables are binary valued: x(i,k) = 1 if the i” element is in the 
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К" set, бк, otherwise, x(i,k) = 0 (Holder [2006-2008]). In practice, combinatorial optimiza- 
tion techniques are applied to a wide array of problems such as optimizing vehicle routing, VLSI 
circuit design, oil/gas pipeline design, steel/paper manufacturing, and matching factories with 
markets via intermediate warehouses (1.е., the transshipment problem). 

Unlike linear programming, the feasibility space of the combinatorial algorithms is not con- 
vex; one needs to determine a global optimal point from several possible local optimal solutions. 
Although most combinatorial programming approaches are either NP-Hard or NP-Complete, in 
practice, many of these approaches can be solved in reasonable time by either choosing alternative 
formulations or exploiting specific features of a problem to compute its exact results (Hoffman 
[2000]). While special cases of some problems can be solved in polynomial time, most common 
approaches for solving combinatorial problems use either approximation algorithms that find a 
solution that is provably close to the optimal in polynomial time or use heuristics that search the 
feasibility space to compute sub-optimal solutions. 


Constraint Optimizations: Constraint optimization (satisfaction) is a family of optimization 
problems that has a constant objective function with a set of constraints that impose conditions 
on the solution variables. The constraint satisfaction problems (CSPs) are characterized by con- 
straints that are defined over a finite domain (Apt [2003], Dechter [2003]). In practice, CSPs has 
been used for scheduling problems, circuit layout in VLSI chips, DNA sequencing, production 
planning, computer vision, and computer games (e.g., sudoku) (Cork Constraint Computation 
Centre, University College Cork [2011], Moore [2011b]). A solution to a constraint satisfaction 
problem is a set of variables that satisfy all of the specified constraints. 

The most common approach to solve the CSP problem is searching through different pos- 
sible alternative solutions. Unlike the generalized search algorithms used in АТ, the CSP search 
algorithms can use heuristics that exploit the problem structure and also perform search in any or- 
der. Most CSP algorithms use depth-first search to traverse the search tree. Often, the traversals 
reach a node where a node cannot be updated as the domain is empty. In such cases, the search 
algorithms backtrack to the previous assignment (1.е., a node at a higher level in the search) and 
restart the search with a new assignment. One of most widely used CSP search algorithms is the 
А* algorithm (Moore [2011a], Nilsson [1980]). The A* algorithm uses the best-first strategy to 
choose the next search node to traverse. 


Nonlinear Programming: The nonlinear programming (NLP) can be viewed as a generalization 
of different mathematical programming formulations. In an NLP formulation, either or both the 
objective function or the constraints can be nonlinear. An NLP problem has the following form: 
minimize f(x), subject to g; (х) = 0, fori = 1,..., ту апалу (х) > 0, for j =m, + 1,...,m, 
т > ті > 0. Depending on the nonlinearity, multiple special cases arise. Foe example, a sce- 
nario where the objective function is nonlinear, but the constraint functions g and л are linear, is 
called a /inearly constrained optimization. If the objective function and the constraints are linear, 
an NLP problem reduces to a /inear programming problem. When only the objective function is 
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quadratic, the problem is termed quadratic programming. If the objective and constraint functions 
are defined over a convex set, the problem is called convex optimizations (Boyd and Vandenberghe 
[2009]). Finally, when both the objective function and constrains are nonlinear, the problem is 
called unconstrained optimization (Network-Enabled Optimization Systems Wiki). 

The difficulty in solving an NLP problem arises from the fact that NLP problems have 
non-convex object function or constraints. Such problems can have multiple feasible regions and 
multiple locally optimal points within each regions. Consequently, the non-convex NLP problems 
can exhibit solutions with /oca/ optima: they are spurious solutions that satisfy the requirements оп 
the derivatives of the constraint functions. To determine if an NLP problem is infeasible (e.g., the 
objective function is unbounded) or a solution is the g/obal optimum can require time exponential 
in the number of variables and constraints. The complexity of solving an NLP problem varies 
according to its type: in general, problems with convex objective function or constraints (e.g., 
quadratic programming) are easiest to solve, while problems that aim to find the g/obal optimal 
are much harder to solve. 


Further Reading 

Basic idea: Greenberg [2010], Holder [2006-2008], Network-Enabled Optimization Sys- 
tems Wiki; Algorithms: Linear Programming ( Dantzig [1963], Gill et al. [1986], Marsten 
et al. [1990], Mehrotra [1992,?], Schrijver [1998], Todd [2002,2], Weisstein [2011]), Inte- 
ger Programming ( Barnhart et al. [1998]), Combinatorial Programming ( Hoffman [2000], 
Kirkpatrick et al. [1983], Papadimitriou and Steiglitz [1998], Schrijver [2002], Spall [2003], 
Vazirani [2003]), Constraint Optimizations ( Apt [2003], Dechter [2003], Moore [2011а], 
Nilsson [1980]), Nonlinear Programming ( Boyd and Vandenberghe [2009], Gould and 
Toint, Network-Enabled Optimization Systems Wiki, Neumaier [2004]); Packages: IBM 
ILOG CPEX IBM Corp. [b], COIN-OR COIN-OR Foundation [2011], Gurobi Gurobi 
Optimization Inc.; Applications: Abara [1989], Anderson et al. [2009], Armacost et al. 
[2004], Cork Constraint Computation Centre, University College Cork [2011], Dutta and 
Fourer [Fall 2001], IBM Corp. [b], Moore [2011b], Papadimitriou and Steiglitz [1998], 
Schrijver [2002]. 





2.14 ON-LINE ANALYTICAL PROCESSING 


On-line analytical processing or OLAP refers to a broad class of analytics techniques that pro- 
cess historical data using a logical multi-dimensional data model (Chaudhuri and Dayal [1997], 
Codd et al. [1993a,b]). Over the years, OLAP has emerged to become the key business intelligence 
(BI) technology for solving decision support problems like business reporting, financial planning 
and budgeting/forecasting, trend analysis and resource management. OLAP technologies usually 
operate on data warehouses which are collections of subject-oriented, integrated, time-varying, non- 
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volatile, and historical collection of data (Chaudhuri and Dayal [1997]). Unlike on-line transaction 
processing (OLTP) applications that support repetitive, short, atomic transactions, OLAP appli- 
cations are targeted for processing complex and ad-hoc queries over very large (multi-Terabyte 
and more) historical data stored in the data warehouses. 


Basic Idea: OLAP applications are targeted for knowledge workers (e.g., analysts, managers) 
who want to extract useful information from a set of large disparate data sources stored in the 
data warehouses. These sources can be semantically or structurally different and can contain his- 
torical data consolidated over long time periods. OLAP workloads involve queries that explore 
relationships within the underlying data and then exploit the acquired knowledge for different 
decision support activities such as post-mortem analysis/reporting, prediction, and forecasting. 
The OLAP queries tend to invoke complex operations (e.g., aggregations, grouping) over a large 
number of data items or records. ‘Thus, unlike the ОГТР workloads, where transaction through- 
put is important, query throughput and response times are more relevant for OLAP workloads. 
Thus, an OLAP system needs to support a logical model that can represent relationships between 
between records succinctly, a query system that can explore and exploit these relationships, and 
an implementation that can provide scalable performance. 


Logical Data Model: Most OLAP systems are based on a logical data model that views data in 
the warehouse as multi-dimensional data cubes. The multi-dimensional data model grew out of 
the two-dimensional array-based data representation popularized by the spreadsheet applications 
used by business analysts (Chaudhuri and Dayal [1997], Gray et al. [1997], Harinarayan et al. 
[1996]). The data cube is typically organized around a central theme, e.g., car sales. This theme is 
usually captured using one or more numeric measures or facts that are the objects of analysis (e.g., 
number of cars sold and the sales amount in dollars). Other examples of numerical measures 
include budget, revenue, retail inventory. The measures are associated with a set of independent 
dimensions that provides the context. For example, the dimensions associated with the car sales 
measure can include the car brand, model and type, various car attributes (e.g., color), geography, 
and time. Each measure value is associated with an unique combination of the dimension values. 
‘Thus, a measure value can be viewed аз an entry in a cell of a multi-dimensional cube with a 
specified number of dimensions. 

In the multi-dimensional OLAP model, each dimension can be further characterized using 
a set of attributes, e.g., the geography dimension can consist of country, region, state, and city. The 
attributes can be viewed as sub-dimensions and can themselves be related in a hierarchical manner. 
‘The attribute hierarchy is a series of parent-child relationships that is specified by the order of 
attributes, e.g., year, month, week, and date. A dimension can be associated with more than one 
hierarchy. The parent-child relationship represents the order of summarization via aggregation: 
the measure values associated of a parent are computed via aggregation of measures of its children. 
‘Thus, the dimensions, along with their hierarchical attributes, and the corresponding measures, 
can be used to capture the relationships in the data. 
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OLAP Queries: Typical OLAP analytics queries perform two main functions: reporting and pre- 


sentation. Reporting involves organizing dimensions and performing computations on the corre- 
sponding measures. Presentation involves selecting dimensions and measures from the original or 
computed versions of the data, and preparing them for display. Functionally, OLAP queries can 
be classified into what-now (post-mortem analysis), what-if (prediction), and what-next (fore- 
casting). To support these analyses, an OLAP engine supports a number of operators. Some of 
the key operators include: Group-by which collates the measures as per the unique values of the 
specified dimensions, s/ice_and_dice which involves reducing the dimensionality of the dataset by 
taking a projection of the data on a subset of dimensions for selected values of the other dimen- 
sions, pivoting or rotating operation re-orients the original cube to visualize the data using new 
relationships, and rollup and drill-down operators support aggregation across hierarchies within 
one or more dimensions. OLAP analysis also involves invoking a variety of analytical functions 
on measure values. The OLAP analytical functions can be broadly classified into aggregation 
(e.g., sum), scalar, and set functions (e.g., sort). These functions can be either user-defined or 
pre-defined by the underlying system. 


OLAP Servers Implementations In practice, the multi-dimensional OLAP model is usually im- 
plemented using one of the three approaches: Relational OLAP (ROLAP), Multi-dimensional 
OLAP (MOLAP), or Hybrid OLAP (HOLAP) (Chaudhuri and Dayal [1997]). In the RO- 
LAP approach, a relational database system is used for storing and processing data in the multi- 
dimensional OLAP model. The data warehouse is implemented as relations stored in tables and 


queried using SQL-based OLAP queries. 


The MOLAP approach stores and processes the multi-dimensional OLAP cubes as multi- 
dimensional arrays. In most cases, the MOLAP cubes are sparse multi-dimensional arrays that are 
stored using specialized data structures to optimize data access costs. Data stored in the MOLAP 
fashion is queried using languages that can express data access using the multi-dimensional array 
model, e.g., Microsoft’s Multidimensional Expressions (MDX) language (Microsoft Developer 
Network). The hybrid OLAP strategy uses a combination of relational amd multi-dimensional 
OLAP implementations to store and process OLAP data. There are two ways for partitioning 
data between ROLAP and MOLAP stores: the first strategy stores the materialized view for a 
query workload in the MOLAP format and maintains the raw, detailed data in the ROLAP for- 
mat, while the second stores some section of the data (e.g., most recent or most commonly used) 
in the MOLAP format, while maintaining the remaining data in the ROLAP format. Examples 
of systems that use the HOLAP approach include OLAP servers from Microsoft, Oracle, and 
SAP (Business Application Research Center). 


Further Reading 
Basic Idea: Chaudhuri and Dayal [1997], Codd et al. [1993a,b], Gray et al. [1997], Algo- 
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rithms: Chaudhuri and Dayal [1997], Gray et al. [1997], Harinarayan et al. [1996], Will- 
halm et al. [2009a]; Packages: Microsoft SQL (Microsoft Corp., Microsoft Developer Net- 
work), IBM DB2 and Cognos TM1 (Chamberlin [1998], IBM Corp. [a]), Oracle (Oracle 
Corp. [a]), Teradata (Teradata Inc.), HP Vertica (Vertica Systems Inc. [2010]), IBM Netezza 
(IBM Netezza), SAP (Intel Corp. [2011], SAP Inc. [2010]) and Palo (едох AG.); OLAP 
Applications: (Business Application Research Center, Chaudhuri and Dayal [1997]). 


2.15 GRAPH ANALYTICS 


Graphs and related data structures (e.g., trees and directed-acyclic graphs (DAGs)) form the fun- 
damental tools used for expressing and analyzing relationships between entities. Relationships 
modeled by graphs include associations, hierarchies, sequences, positions, and paths (Barabasi 
[2003], Chakrabarti and Faloutsos [2006], Newman [2010]). Graphs have been used in a wide 
array of diverse application domains: from biology, chemistry, pharmacology, linguistics, eco- 
nomics, and operations research to different problems in computer science. 


Basic Idea: Formally, a graph G = (V, E) is described using a set of vertices (or nodes) V and the 
edges E that connect them. The graph vertices or edges can be weighted, and the edges can be 
directed or undirected. Graph vertices can also support additional attributes, e.g., a color. Usually, 
the graph is traversed in a pattern which is determined by either the graph characteristics or 
external constraints. This basic formulation of the graphs can be used for building complex data 
models. For example, molecular structure of chemical compounds is usually represented using 
graphs whose nodes represent atoms and edges represent bonds; a node- and edge-weighted graph 
can represent a transportation system between cities of a region, where the node weight represents 
city population and edge weight represents transportation density. Other applications of graphs 
include biological modeling, sociology (e.g., social network analysis), linguistics (e.g., expressing 
language syntax and semantics), electrical circuit design, combinatorial optimizations (e.g., flow 
problems), and neuroscience (e.g., modeling brain’s cognitive connections (van den Heuvel et al. 
[2008]). 

Graph analytics refers to a class of techniques that either use graph models to solve а prob- 
lem (e.g., the traveling salesman and other optimization problems), or to analyze and exploit 
inherent graph structures of a problem (e.g., identifying sub-graphs with a well-defined struc- 
ture, motifs (Milo et al. [2002]), from graphs representing chemical compounds). Broadly, the 
graph algorithms can be classified into three overlapping categories: structural algorithms that 
analyze and exploit different topological properties of a graph, traversal algorithms that navigate 
different paths in a graph, and pattern-matching algorithms that find instances of different graph 
patterns (e.g., cycles) in a graph. These algorithms are characterized by how the graph data is 
interpreted and analyzed, and how the graphs are represented. Common graph representations 
include directed and undirected graphs, graphs with weights on the edges and vertices, rooted 
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graphs, commonly called trees, and their variants. In practice, the graphs are represented using 
two different formulations: the first one enumerates the graph edges using lists, while the second 
uses matrices to capture the graph structure. The list-based data structures include adjacency list 
that colocates a vertex with its neighbors (1.е., vertices with direct connections), and incident list 
that lists all edges as pairs or tuples (for directed graphs). The list-based representation is popular 
in the algorithms that navigate the graph structure. The matrix formulation is used to provide a 
concise representing of different structural attributes of a graph, e.g., connectivity, weights, di- 
rection, etc. In most situations, the graph matrices are sparse and are implemented using compact 
array-based data structures. 


Structural Algorithms: Graph structural algorithms, commonly known as network analysis algo- 
rithms, analyze symmetric and asymmetric relationships between networked entities by exploring 
structure of the underlying graph. Usually, the networked data is represented via digraphs or net- 
works (Barabasi [2003], Chakrabarti and Faloutsos [2006], Newman [2010]). These networks 
can be structurally classified into two categories: small-world networks in which distance be- 
tween any two randomly chosen vertices grows proportional to the logarithm of the total number 
of vertices in the network (Watts and Strogatz [1998]), and scale-free networks, where the degree 
distribution follows the power law (Barabasi and Bonabeau [2003]). Graph structural algorithms 
are designed to understand and exploit inherent abstract structural properties of a network. Such 
structural information can be used for different purposes, for example, telecom companies can use 
the structural information of the call graphs to identify customers most likely to switch carriers 


(also called churn analysis; Dasgupta et al. [2008], Nanavati et al. [2006], Richter et al. [2010]). 


Traversal Algorithms: ‘The second class of graph algorithms involves traversing edges of a graph 
to find solution of the associated problem. Graph traversal algorithms operate on graphs that ei- 
ther capture the structure of some underlying physical network (e.g., roads, pipes, etc.) or capture 
the abstract model of a problem (e.g., a tree representing an XML document, ога graph represent- 
ing cities and distances in a traveling salesman problem). Unlike the structural algorithms where 
in many cases analytical solutions are computable via matrix formulation, problems addressed us- 
ing traversal algorithms are notoriously complicated to solve—many of them are NP-Complete, 
and thus the algorithms must make extensive use of heuristics. 

‘The traversal algorithms are used to solve: (1) route problems which aim to optimize path 
lengths under different traversal constraints, (2) fow problems that investigate flow of material 
(e.g., oil, gas, cars, etc.) over a network that is represented by the underlying directed graph, (3) 
coloring problems that label graph elements (e.g., vertices) to satisfy certain constraints, and (4) 
searching problems that find a problem solution by traversing vertices which encode the problem 
states. 


Pattern-matching Algorithms: Те final class of graph algorithms focuses on finding different 
patterns in an input graph. Most common graph patterns include cycles, various types of cliques 
(an undirected graph formed using a subset of vertices such that every two vertices are connected), 
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sub-graphs with certain properties (e.g., isomorphic with some other sub-graph), and network 
motifs (Chakrabarti and Faloutsos [2006]). Important practical applications of pattern match- 
ing include: social analytics, workforce analytics, epidermiology, financial network modeling, and 
neuroscience. Graph pattern matching also forms the basic tool for clustering nodes from a graph 
based on a similarity metric (Schaeffer [2007]). 

‘The generalized combinatorial problems of enumerating or identifying structural patterns 
(e.g., finding maximal sub-graphs) in a graph are NP-Complete. Hence, these problems are solved 
approximately by using heuristics or their constrained versions are solved in polynomial time. 
While a majority of graph pattern-matching algorithms use traversal-based approaches to reach at 
the solution, some pattern-matching problems can be solved using the matrix representation (e.g., 
using adjacency matrix) of a graph. In particular, in certain scenarios, the clique discovery problem 
can be viewed as a graph partitioning or clustering problem and solved using spectral clustering 
techniques (Schaeffer [2007]), including the non-negative matrix factorization approach (Ding 
et al. [2008]). 


Further Reading 

Basic Idea: Chakrabarti and Faloutsos [2006], Harary [1969], Newman [2010]; Algo- 
rithms: Structural (Alon [1998], Barabasi and Bonabeau [2003], Dhillon et al. [2005], Page 
et al. [1999], Watts and Strogatz [1998]), Traversal (Consortium [a,b], Harary [1969]), and 
Pattern-matching (Baskerville and Paczuski [2006], Bron and Kerbosch [1973], Chakrabarti 
and Faloutsos [2006], Ding et al. [2008], Grochow and Kellis [2007], Schaeffer [2007]); 
Packages: SPSS Modeler (IBM SPSS [2010a]) and R (The R Foundation); Applications: 
Barabasi [2003], Cecchi et al. [2008], Chakrabarti and Faloutsos [2006], Dasgupta et al. 
[2008], Krebs, Leskovec et al. [2014], Lohmann et al. [2010], Micheloyannis et al. [2006], 
Milo et al. [2002], Nanavati et al. [2006], Podolyan and Karypis [2008], Porter et al. [2009], 
Richter et al. [2010], Stam et al. [2006], van den Heuvel et al. [2008]. 





47 


CHAPTER 3 


Accelerating Analytics 


3.1 CHARACTERIZING ANALYTICS EXEMPLARS 


In the previous chapters, we discussed various stages in a typical analytics execution workflow 
and presented a set of analytics model exemplars. In this chapter, we focus on computational and 
runtime characteristics of these exemplars and discuss its impact on accelerating various analytics 
algorithms. 

Table 3.1 presents a summary of the analytics exemplars with their associated problem 
types, functional goals (Table 1.3), and key algorithms. As Table 3.1 illustrates, an analytics ex- 
emplar can have multiple functional goals. For example, regression algorithms can be used for 
either for predicting target classes for input data (classification) or as a statistical tool for quanti- 
tative analysis. The Naive Bayes algorithm can be used in text analytics or for general clustering 
purposes. Each exemplar can be implemented by multiple algorithms, each designed for specific 
runtime and user constraints (e.g., linear, logistic, and probit regression). Further, an analytics al- 
gorithm can be used by different exemplars for achieving different functional goals. For example, 
logistic regression can be used as a statistical tool in a data analysis workload or as a classifier in 
a neural network workload. Each algorithm, depending on the runtime constraints, i.e., whether 
the application data can fit into main memory or not, can use a variety of algorithmic kernels (Fig- 
ure 1.2). Finally, a real analytics workload consists of one or more of the analytics components, 
each with potentially different functional goals, runtime requirements and data requirements. For 
example, key components of the IBM Watson DeepQA system used in the Jeopardy! Challenge 
include natural language processing of input queries, regression for ranking candidate answers, 
and simulation for modeling different waging scenarios (Ferrucci et al. [2010]). 

Table 3.2 presents a summary of computational patterns, key data types, data structures, 
and functions used by algorithms for each exemplar, while Table 3.3 summarizes the runtime 
characteristics of these exemplars. 


3.1.1 COMPUTATIONAL PATTERNS 


As Table 3.2 illustrates, while different exemplars demonstrate distinct computational charac- 
teristics, they also exhibit key similarities. Most analytics exemplars operate on data that can be 
inherently structured or unstructured (notable exception being Monte Carlo methods that gener- 
ate new data based on few input parameters). Algorithmically, many exemplars operate on sparse 
and high-dimensional data. Such data needs to be transformed so as to make the task compu- 
tationally feasible, e.g., dimensionality reduction using principal component analysis or singular 
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Table 3.1: Analytics exemplar models, along with problem types and key application domains 











(Supervised learning) 


Prediction, Reporting 


Analytics Exemplar Functional Goals Key Algorithms 

(Problem type) 

Regression analysis Prediction Linear, Nonlinear, Logistic Regression 
(Inferential statistics) Quantitative Analysis | Probit Regression 

Clustering Recommendation, K-Means and Hierarchical clustering 


EM Clustering, Naive Bayes 

















Nearest-neighbor search Prediction, K-d, Ball, and Metric trees 
(Unsupervised learning) Recommendation Locality-sensitive Hashing 
Approx. Nearest-neighbor 
Association rule mining Recommendation Apriori, Partition, FP-Growth, 
(Unsupervised learning) Eclat and MaxClique, Decision trees 
Recommender Systems Recommendation Pearson Colleration, 
(Unsupervised learning) Latent factor/Non-negative fatorization 
Nearest-neighbor, Naive-Bayes Classifier 
Neural networks Prediction Single- and Multi-level perceptrons, 
(Supervised learning) Pattern matching ЕВЕ Recurrent, and Kohonen networks 
Support Vector Machines Prediction SVMs with Linear, Polynomial, RBF, 
(Supervised learning) Pattern matching Sigmoid, and String kernels 
Decision tree learning Prediction 103/С4.5, CART, CHAID, QUEST 
(Supervised learning) Recommendation 





Time series processing 


Pattern matching, 


Trend, Seasonality, Spectral analysis, 








(Modeling and simulation) 


Quantitative analysis 


(Data analysis) Alerting ARIMA, Exponential smoothing 
‘Text analytics Pattern matching Naive Bayes classifier, String-kernel SVMs, 
(Data analysis) Reporting Latent semantic analysis 
Prediction Non-negative matrix factorization 
Monte Carlo methods Simulation Markov-chain, Quasi-Monte Carlo methods 





Mathematical programming 
(Optimization) 


Prescription 
Quantitative analysis 


Primal-dual interior point, 
Branch and Bound Methods 
А* algorithm, Quadratic Programming 





On-line analytical processing 
(Structured data analysis) 


Reporting 


Prediction 


Group-By, Slice_and_Dice, Pivoting, 
Rollup and Drill-down, Cube 





Graph analytics 
(Unstructured data analysis) 








Pattern matching 
Recommendation 





Eigenvector Centrality, Routing, 
Searching and flow algorithms 
Clique and motif finding 








value decompostion (Berry et al. [1994]). Further, many exemplars are formulated as optimization 
problems which terminate when the target cost function is optimized. 

‘These factors impact the data structure design, computational patterns, and functions of 
the exemplars. Broadly, the analytic exemplars can be classified into two classes: the first class ex- 
ploits mathematical (e.g., linear algebraic) formulations and the second operates on non-numeric 
data structures. Exemplars belonging to the first class, e.g., mathematical programming, Monte 
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Table 3.2: Computational characteristics of the analytics exemplars 











Analytics Computational Patterns Data types, Data structures, 
Exemplar and Functions 
Regression Matrix inversion, LU decomposition Double-precision and Complex data 
Analysis Transpose, Factorization Sparse/Dense matrices, Vectors 

Dot Product Calculations 
Clustering Cost-based iterative convergence Height-balanced tree, Graph, 


Distance functions, log function 





Nearest-Neighbor 
Search 


Distance calculations, Hashing 
Singular value decomposition, 


Higher-dimensional data structures, 
Hash tables, Distance functions 
Dot Product Calculations 











Gradient Descent Algorithms 
FFTs, Convolution, Cross-correlation 


Association Set intersections, Unions, and Counting | Hash-tree, Prefix trees, Bit vectors 

Rule Mining 

Recommender Vector-space Similarity, Vectors, Sparse/Dense matrices 

Systens Latent and non-negative factorization Single-/Double-precision data 
Naive-Bayes classifer, Nearest neighbor Dot Product Calculations 

Neural Networks Matrix-Matrix/-Vector Multiplications Sparse/dense matrices, Vectors, 


Single-/Double-precision, 
Complex data, Dot Product 





Support Vector 
Machines 


Linear Solvers 


Double-precision Sparse matrices 
Vectors, Kernel functions 
Dot Product Calculations 





Decision ‘Trees 


Dynamic programming 
Recursive Tree Operations 


Integers, Double-precision, ‘Trees, 
Vectors, Log function 





Time Series 


Smoothing via averaging, Correlation 


Integers, Single-/Double-precision 











Greedy Algorithms 


Processing Fourier and Wavelet transforms Dense matrices and Vectors 
Distance and Smoothing functions 
Text Analytics Parsing, Bayesian modeling Integers, Single-/Double-precision, 
Matrix Factorization, Multiplication Sparse matrices, Vectors, 
Hashing, String Matching Strings, Distance functions 
Set Operations (Union and Intersection) | String functions, Inverse Indexes 
Dot Product Calculations 
Monte Carlo Random number generators Double-precision, Bit vectors 
Methods Polynomial evaluation, Interpolation Bit-level operationss 
Mathematical Linear Solvers, Factorization Integers, Double-precision, 
Programming Dynamic programming, Vectors, Trees, 


Sparse matrices, Adjacency list 





On-line Analytical 


Processing 


Grouping and ordering 
Aggregation over hierarchies 


Prefix trees, Relational tables, 
Sorting, Ordering, OLAP Operators 





Graph Analytics 








Graph traversal, Eigensolvers, 
Matrix-Matrix/-Vector multiplication 
Non-negative matrix factorization 





Integer, Single-/Double-precision 
Sparse matrices, Trees 


Adjacency Lists 
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Carlo methods, regression analysis, recommender Systems, support vector machines, and neural 
networks use linear-algebraic formulations to capture relationships in the underlying data. The 
second class, which includes clustering, nearest-neighbor search, associative rule mining, deci- 
sion tree learning, and OLAP use non-matrix data structures. Exemplars like text analytics, and 
graph analytics can use either approaches. Linear algrebraic frameworks usually use two- or three- 
dimensional matrices to encode relationships in low-dimensional data and vectors for represent- 
ing locations in high-dimensional data space. Based on the type of data relationship, matrices 
can be either sparse or dense and are used in various linear algebraic kernels like matrix-matrix, 
matrix-vector multiplications, inversion, transpose, linear solvers, and various factorizations such 
as Cholesky factorization, Singular Value Decomposition, and Non-negative matrix factorization 
(Berry et al. [1994], Kleinberg and Tomkins [1999]). In case the data has complex relationships 
(e.g., OLAP data organized in multiple shared hierarchies) or employs operations that cannot be 
encoded as matrix or vector operations, analytics exemplars use data structures like hash tables, 
queues, sets, graphs, inverse indexes, adjacency lists, and trees. Common operations on these data 
structures include traversals, hash table queries, set union and intersections, sorting, and grouping. 

‘The analytic exemplars use a variety of types, such as integers, strings, bit vectors, and float- 
ing point variables (e.g., single, double precision, and complex) to represent input data and output 
results. Most exemplars require high precision calculations (certain mathematical programming 
algorithms require both high precision and result repeatability). One notable exception is neural 
networks, where certain algorithms can be implemented using low precision types (Gupta et al. 
[2015]). 

Finally, these exemplars use a wide array of functions to compare, transform, and modify 
input data. Examples of common analytic functions include various distance functions (e.g., Eu- 
clidian), kernel functions (e.g., Linear, Sigmoid), statistical and aggregation functions (e.g., Sum, 
Min, or Average), data organization functions (e.g., sorting, hashing, and grouping), smoothing 
functions (e.g., correlation). These functions, in turn, make use of intrinsic library functions such 
as log, sine, or sqrt or various bit manipulation routines. 


3.1.2 RUNTIME CHARACTERISTICS 


Table 3.3 summarizes the runtime characteristics of the analytics exemplars. ‘The key distinguish- 
ing feature of analytics applications is that they usually process input data in the read-mostly 
format. The input data can be scalar, structured (e.g., images, or unstructured (e.g., raw text), 
and is usually read from files, streams or relational tables in the binary or text format. In most 
cases, the input data is large, which requires analytics applications to store and process data from 
disks. In case of time-series processing, the large volume data is usually streamed, and can be both 
structured (e.g., web-service messages) or unstructured (e.g., text messages). Notable exceptions 
to this pattern are Monte Carlo Methods and Mathematical Programming, which are inherently 
in-memory as they operate on small input data. In most cases, the results of analysis are usu- 
ally smaller than the input data. Only three exemplars—association rule mining, Monte Carlo 
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methods, and on-line analytical processing—generate larger output. Most analytics exemplars, 
with the exception of time-series processing, operate in the batch mode; time-series processing 
has real-time constraints. Finally, analytics algorithms can involve one or more stages (e.g., in 
supervised learning), where each stage can invoke the underlying algorithm in an iterative or 
non-iterative manner. For the iterative workloads, for the same input data size, the running time 
can vary depending on the precision required in the results. 


Table 3.3: Runtime characteristics of the analytics exemplars 

































































Analytics Exemplar Execution characteristics Input-Output characteristics 
Methodology | Memory Issues || Input Data Output Data 
Regression Analysis Iterative In-memory Large historical Small 
Disk-based Structured Scalar 
Clustering Iterative In-memory Large historical Small scalar 
Disk-based Unstructured Unstructured 
Structured Structured 
Nearest-Neighbor Non-iterative In-memory Large historical Small 
Search Structured Scalar 
Structured 
Association Rule Iterative In-memory Large historical Larger 
Mining Non-iterative | Disk-based Structured Structured 
Recommendation Non-iterative | Disk-based Large historical Small 
Systems Structured Structured 
Neural Networks Iterative In-memory Large Small 
Two Stages Disk-based Structured Scalar 
Support Vector Iterative In-memory Large Small 
Machines Two Stages Disk-based Structured Scalar 
Decision Tree Iterative In-memory Large Small 
Learning Two Stages Disk-based Unstructured Scalar 
‘Time Series Non-iterative | In-memory High volume streaming | Smaller 
Processing Real-time Unstructured Scalar 
Structured Streaming 
Text Analytics Iterative In-memory Large historical Large/small 
Non-iterative | Disk-based Unstructured Unstructured 
Structured Structured 
Monte Carlo Iterative In-memory Small Large 
Methods Scalar Scalar 
Mathematical Iterative In-memory Small Small 
Programming Scalar Scalar 
On-line Analytical Non-iterative In-memory Large historical Larger 
Processing (OLAP) Disk-based Structured Structured 
Graph Analytics Iterative In-memory Large historical Small 
Disk-based Unstructured Unstructured 
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3.2 IMPLICATIONS ON ACCELERATION 


Given the varied computational and runtime characteristics of the analytics exemplars, it is clear 
that a single solution for accelerating different analytics applications would be sub-optimal. As 
Tables 3.2 and 3.3 demonstrate, each exemplar has a unique set of computational and runtime 
features, and ideally, every exemplar would get a system tailor-made to match its requirements. 
However, we have also observed that different analytic exemplars share many computational and 
runtime features. Therefore, for a systems designer, the challenge is to customize analytics sys- 
tems using as many re-usable software and hardware components as possible. In this section, 
we describe various opportunities for accelerating analytics workloads on existing software and 
hardware systems, and then discuss how to build re-usable accelerated components. 


3.2.1 SYSTEM ACCELERATION OPPORTUNITIES 


Based on the computational and runtime characteristics described in Tables 3.2 and 3.3, we first 
classify the analytics exemplars based on their performance bottlenecks: 


e Compute-bound Exemplars: Mathematical Programming and Monte Carlo Methods 
e Compute-bound or Network-/Memory-bound Exemplar: Time-series Processing 


e Compute-bound (when in-memory) and I/O-bound (when disk-based): Text Analytics, 
Regression Analysis, Clustering, Nearest-neighbor Search, Neural Networks, Support Vec- 
tor Machines, Recommender Systems 


* Memory-bound (when in-memory) and I/O-bound (when disk-based): OLAP, Graph 
Analytics, Text Analytics, Decision Tree Learning, Associated Rule Mining 


Among all the exemplars, only two—mathematical programming and Monte Carlo 
methods—are purely compute-bound. A majority of the remaining exemplars are compute-bound 
when the data is entirely in-memory or affected by the cost of accessing data from network, mem- 
ory or disks (time-series algorithms usually operate on streaming data and are bound by network 
latencies). For these exemplars, the amount of computation usually increases (in proportion based 
on the algorithmic complexity) as the amount of data is increased. Thus, in such cases, digger data 
translates into digger compute as well. The remaining exemplars are memory-bound when the 
data is in-memory and I/O-bound when data is on disks. Thus, accelerating analytics workloads 
requires a holistic approach that address these interconnected bottlenecks: memory, interconnect, 
compute, and storage. Traditional acceleration approaches take one or more of these paths: (1) Al- 
gorithmic modifications to exploit the available hardware resources; (2) Improving performance 
of existing functions using better hardware; and (3) Employ new algorithms that can exploit novel 
architectures. 

Approaches in the first path involve techniques to parallelize the existing algorithms to en- 
able them to exploit multiple computing resources (e.g., multiple cores on a processor or multiple 
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CPU nodes). The type of parallelization approach depends on the computation and runtime prop- 
erties of the exemplars. A majority of analytics exemplars use shared state to represent either the 
program state or a cost function to be optimized. Among the data structures used for implement- 
ing the shared state, only matrices are amenable for scalable distributed-memory parallelization. 
A majority of analytics exemplars operate on sparse data, materialized in memory using either 
sparse matrices or specialized data structures such as prefix trees (Lakshmanan et al. [2003]). Op- 
erations involving sparse data structures involve accesses via indirection arrays that can generate 
non-contiguous memory accesses. A number of algorithms that use sparse data structures, in par- 
ticular those that operate on graphs, generate gather-scatter memory access patterns. ‘These issues 
limit the use of distributed-memory parallelization approach to the exemplars that use matrices 
to represent shared state (e.g., Regression, Clustering, OLAP, etc.) Among all examplars, only 
the Monte Carlo methods are inherently embarrassingly parallel, while some approaches need 
special reformulation to eliminate execution dependencies, e.g., using the Alternate-Direction 
Multiplier Method (ADMM) to parallelize optimization problems (Boyd et al. [2011]). In those 
scenarios, the exemplars are operating on out-of-core data sets, techniques such as MapReduce 
(Bekkerman et al. [2011], Leskovec et al. [2014]) can be exploited. However, the MapReduce 
approach is ideally suited for non-iterative, embarrrassingly-parallel algorithms that run in the 
batch mode. 

The second approach involves using better hardware to accelerate key performance com- 
ponents of the exemplars. For example, one could use solid-state drives (SSDs) for accelerating 
read-only I/O accesses. The SSDs would also improve performance of non-contiguous disk ac- 
cesses that may be generated during computations involving sparse data. The exemplars that are 
memory-bound would benefit from improved memory hierarchies: deeper cache hierarchies and 
larger main memory systems. Finally, time-series processing would benefit from faster network- 
ing systems with such as InfiniBand that allow remote direct-memory access (RDMA). 

Finally, techniques in the third approach involve using algorithmic techniques that exploit 
dedicated hardware accelerators, such as single-instruction multiple-data (SIMD) instructions 
(e.g., x86 AVX or Power VSX), GPUs, FPGAs, and ASICs. Unlike the first two approaches, 
this approach is more suited toward accelerating key computational kernels and functions. As il- 
lustrated in Table 3.2, the analytics exemplars exhibit several repeated computational kernels and 
functions that can be accelerated using hardware accelerators. Examples of such kernels include 
various matrix operations such as matrix-matrix, matrix-vector multiplications, linear solvers, fac- 
torization. These can be easily accelerated using accelerated libraries such as Intel MKL, ESSL, 
or CUBLAS which use data-parallel features of SIMD or GPUs. Other kernels that can be accel- 
erated using SIMD or GPUs include FFT, convolution, hashing, sorting, and random number 
generators. GPUs and SIMD capabilities can be also used for accelerating various functions used 
by the exemplars, e.g., various distance functions, set operations, aggregation and statistical func- 
tions (e.g., MIN, MAX, Average), and bit-vector operations. Several of these functions can be 
also accelerated by specialized implementations on FPGAs or ASICs. In particular, FPGAs are 
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Table 3.4: Opportunities for parallelizing and accelerating analytics exemplars. 











Model Exemplar Bottleneck Acceleration Requirements and Opportunities 
Regression Analysis Compute-bound | Shared- and distributed-memory task parallelism 
Clustering T/O-bound Data parallelism via SIMD or GPUs 


Nearest-Neighbor Search 





Recommender Systems 





Neural Networks 


Support Vector Machines 


Faster I/O using solid state drives 





Association Rule Mining 


Memory-bound 
T/O-bound 


Shared-memory task parallelism 

Faster I/O using solid state drives 

Larger, deeper, and faster memory hierarchies 
Faster bit operations or tree traversals via FPGAs 





Decision Tree Learning 


Memory-bound 
T/O-bound 


Larger, deeper, and faster memory hierarchies 





Time Series Processing 


Compute-bound 
Memory-bound 


Shared-memory task parallelism 

Data parallelism via SIMD or GPUs 

High-bandwidth, low-latency memory and networking 
Pattern matching via FPGA 











Text Analytics Compute-bound | Shared- and distributed-memory task parallelism 
Memory-bound | Data parallelism via SIMD or GPUs 
T/O-bound Larger, deeper, and faster memory hierarchies 
Faster I/O via solid state drives 
Pattern matching and string processing via FPGA 
Monte Carlo Methods Compute-bound | Shared- and distributed-memory task parallelism 
Data parallelism via SIMD or GPUs 
Faster bit manipulations using FPGAs or ASICs 
Mathematical Compute-bound | Shared-memory task parallelism 
Programming Data parallelism via SIMD or GPUs 


Larger and deeper memory hierarchies 


Search-tree traversals via FPGAs 





On-line Analytical 


Memory-bound 


Shared- and distributed-memory task parallelism 








Processing T/O-bound Data parallelism via SIMD or GPUs 
Larger and deeper memory hierarchies 
Pattern matching via FPGAs, 
Faster I/O using solid state drives 
Graph Analytics Memory-bound | Shared-memory task parallelism 





T/O-bound 





Larger and deeper memory hierarchies 


Massive data-parallelism via GPUs 











ideally suited for accelerating computational patterns that involve extensive branching (e.g., those 
based on finite state machine automata), bit-vector manipulations, or dataflow execution. For ex- 
ample, FPGAs can be used to accelerate string pattern matching functions (useful in text analytics 
and time series processing), bit manipulation, and tree traversals functions. 
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‘These three approaches can be used for building re-usable software components that can 
be specialized to take advantage of available hardware features. Given the functional flow of the 
analytics workloads (Figure 1.2), one can build an analytics workload using libraries that employ 
different analytics algorithms based on user or runtime constraints, where an algorithm can have 
multiple system implementations, e.g., using shared- or distributed-memory parallelism or us- 
ing MapReduce. Each implementation can, in-turn, use specialized kernels or functions that can 
exploit various hardware accelerations such as SIMD, GPUs, or FPGAs. ‘The overall execution 
can be further improved by using hardware components (e.g., SSDs) suited for individual algo- 
rithm. Such hardware-software co-design would then enable optimized analytics solutions that 
can balance customization and commoditization. 
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CHAPTER 4 


Accelerating Analytics in 
Practice: Case Studies 


4.1 TEXT ANALYTICS 


Broadly, text analytics workloads can be classified into two groups based on their computational 
patterns, namely, those workloads that operate on the text data in native form, and those that 
operate on the data structures that are derived from the input data (e.g., TF/IDF matrices). These 
factors determine the type of acceleration employed in a text analytics workload. 

Key workloads that use native text processing include natural language processing (NLP), 
network intrusion detection systems, bio-informatics, and semi-structured text processing (e.g., 
XML and RDF data). One of the basic tasks of any NLP system is to parse the natural language 
input. The most common approach for natural language parsing uses syntactic parsing that ana- 
lyzes the grammatical structure of sentences and predict their parse trees. This approach uses the 
Cocke-Kasami-Younger (CKY ) dynamic programming algorithm to identify most likely parse trees 
for context-free languages. The CKY algorithm uses a two-dimensional array called CKY table 
to store all possible derivations from the context-free grammar. Computations on the CKY table 
are highly parallelizable and can also be viewed as matrix multiplication operations (Thompson 
[1994]). The CKY parser has been parallelized over traditional parallel systems using OpenMP 
and MPI (Johnson [2011]), FPGAs (Bordim et al. [2003], and GPUs (Yi et al. [2011]). Cur- 
rently, the highest performing CKY parser for context-free languages uses GPUs and provides 2 
to 5 orders of magnitude improvement over CPU implementations (Canny et al. [2013]). 

Another interesting application of text analytics is for network intrusion detection system 
such as Snort (Roesch [1999]). Intrusion detection systems (IDS) identity malicious incoming 
traffic via inspecting packet payload for attack signatures. Most IDSs including Snort encode the 
attack signatures as strings and compare input network packet headers against multiple attack sig- 
natures to identify any malicious pattern. In practice, the string matching operation account for 
up to 70% of Snort execution time. Broadly, string matching algorithms can be classified into two 
groups based on the underlying data structures: algorithms that construct finite state machines 
(FSM) via building tree representations (e.g., the Aho-Corasick algorithm builds a prefix-tree 
(trie) with additional links between various internal nodes), and algorithms that view strings as 
character arrays and operate on them using set operations, (e.g., Boyer-Moore or Boyer-Moore- 
Horspool algorithms). FSM-based algorithms are more amenable for hardware implementations 
using FPGAs, whereas the set-oriented string matching algorithms are more amenable to acceler- 
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ation via SIMD. In particular, both Intel AVX and Power VSX SIMD instruction sets have been 
used to accelerate string matching problems (Ladra et al. [2012]) . Over the years, there have also 
been several efforts to accelerate FSM-based string and regular expression matching functions 
in intrusion-detection systems using FPGAs (Aldwairi et al. [2005]). A recent activity in this 
space is the emergence of specialized processors designed to accelerate non-deterministic finite 
automata problems, e.g., the Micron Automata processor (Dlugosch et al. [2014]). This processor 
has been shown to accelerate a number of finite-automata based text analytics problems (Roy and 
Aluru [2014]. In general, FSM-based algorithms are also applicable to the general problem of 
regular-expression matching for text analytics. Regular expression matching has several uses in- 
cluding intrusion detection and bio-informatics (Atasu et al. [2013]). Another interesting use of 
FSM-algorithms is for processing XML documents. The XML execution model views an XML 
document as a rooted ordered tree. Various XML query languages such as XPath and XQuery 
then navigate the XML document tree. Both parsing input XML documents and XML tree 
traversals use finite-state automata. Both of these operations are amenable to FPGA acceleration 
and there are several systems in use (e.g., the DataPower XML accelerator) that exploit FPGA to 
specifically accelerate XML computations. FPGA acceleration of XML processing is particularly 
attractive in streaming domain, e.g., in the web-services scenario, where incoming XML packets 
need to parsed and processed in real-time (Dai et al. [2010]). 

‘The second class of text analytics workloads, e.g., text classification, clustering, semantic 
and topic analysis, uses matrices to summarize key information about the underlying text docu- 
ment corpus and operates on these matrices to get relevant information (Berry et al. [1995]). Key 
matrix-based operations for text analytics include Singular Value Decomposition, Non-negative 
Matrix Factorization, and Eignenvector computations (e.g., PageRank). In most cases, the matri- 
ces are sparse, and performance of these operations on traditional multi-core CPUs is very poor. 
In these cases, GPUs have been shown be very effective for improving the performance of matrix 
computations (Zhang et al. [2009]). 


Further Reading 

Algorithms: Parsing (Canny et al. [2013], Thompson [1994]), Snort (Roesch [1999]), Ma- 
trix Algorithms (Berry et al. [1995]); Accelerator Systems: SIMD exploitation (Ladra et al. 
[2012], Salapura et al. [2012], Shi et al. [2011]), FPGA (Aldwairi et al. [2005], Atasu et al. 
[2013], Bordim et al. [2003], Court and Herbordt [2007], Cronin [2014], Dai et al. [2010], 
Mitra et al. [2009a], Roy and Aluru [2014], Schlegel et al. [2013]), GPUs (Canny et al. 
[2013], Johnson [2011], Kysenko et al. [2012], Yi et al. [2011], Zhang et al. [2009]), Mi- 
cron Automata (Dlugosch et al. [2014], Roy and Aluru [2014]). 
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Deep learning is perhaps the most exciting and practical area of computer science right now. 
Broadly, deep learning refers to an architecture which is built using multiple layers of machine 
learning components, e.g., neural networks with many hidden layers (Bengio [2009]). A deep 
learning system is designed to learn complex functions that represent high-level abstractions such 
as images, speech, or languages. In practice, a deep learning system is built using different types 
of neural networks, e.g., convolution neural networks and fully-connected perceptrons (FLPs), 
and can be very deep (e.g., GoogLeNet has 22 layers; Szegedy et al. [2014]). In this section, we 
discuss various approaches in accelerating deep learning systems. 

Most commercial installations of deep learning systems are geared toward addressing prob- 
lems in the consumer space such as intelligent image and video processing (e.g., the Clarifai im- 
age processing system) and speech recognition systems e.g., Deep Speech by Baidu, IBM Watson 
Speech Recognition, or Google Now (Hinton et al. [2012], Ng [2015]). The basic goal of these 
systems is to accurately classify input query objects (e.g., identify the breed of dog from a set of 
dog pictures or identify phonemes from a set of audio samples). Some systems are also capable 
of providing advanced features such as finding similar images or speech-to-text transcription. To 
accurately classify an object, the system needs to first extract as many of its features as possible. 
Therefore, most deep learning systems have an hybrid architecture, where the initial stages per- 
form feature extraction and the later stages perform classification. The feature extraction layers 
either use the convolution or recurrent neural networks. The classification layers are implemented 
using fully-connected perceptrons (FLPs). Figure 4.2 represents a typical speech processing deep 
learning system which has two convolution layers and four fully connected layers. The system 
takes audio speech signal as an input and identifies the corresponding phoneme. First, the speech 
signal gets pre-processed into samples of frequency spectrogram which captures the key coarse- 
grained features of the input signal. These samples are then processed by two layers of convolution 
neural networks that capture fine-grain features of the input data. Output of the convolution lay- 
ers is then classified by a series of FLPs. ‘The output of this system is a one of the pre-determined 
phoneme classes (a typical number of classes is 32 K). 

The deep learning systems use supervised training approach to train the neural network 
models ( а network model broadly refers to the matrices used to represent the state of the system) 
Training neural network usually involves solving a non-convex cost function. Most systems solve 
this optimization problem using variants of the gradient descent algorithm, e.g., the stochastic 
gradient descent algorithm. The training process involves a forward pass to compute the current 
state of the system, and a backward pass that updates the weights of the system based on the gradi- 
ent descent solution. Computationally, both forward and backward passes involve matrix-matrix 
and matrix-vector multiplications over large single-precision matrices (convolution operation can 
be implemented using FFT as well; Vasilache et al. [2014]). In practice, to improve the classifica- 
tion accuracy, deep learning systems use large training datasets, have deeper network layers, and 
have large models. Further, certain neural network models, e.g., convolution neural networks, are 
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Figure 4.1: Architecture of a deep learning system. 


far more computationally intensive than the traditional fully-connected perceptrons. For exam- 
ple, in a speech processing deep learning system, around 30% of the end-to-end time is consumed 
by the convolution neural network stages. Each convolution neural network layer requires 3 ж 10° 
floating point operations per a minibatch of training data per iteration. Each iteration operates 
over multiple hours of training data, where each hour corresponds to around 1500 minibatches, 
and it takes around 30 iterations to converge (уап den Berg et al. [2015]). Thus, the total amount 
of computation required for training a deep learning pipeline is extremely large. 

‘Thus, training a deep learning system is a big data problem in which larger datasets lead to 
even bigger computational requirements due to the computationally expensive kernels. Clearly, 
the only way of improving the performance is to parallelize the overall computation and accelerate 
individual kernels. There are three common approaches for parallelizing the training process: (1) 
model parallelism in which the model (1.е., the matrices) are partitioned and individual sub- 
matrices are then trained over the entire training set; (2) data parallelism in which the entire 
model is trained concurrently using distinct pieces of the training set (called minibatches); and 
(3) hybrid parallelism that uses both model and data parallelism (Krizhevsky [2014]). A deep 
learning system is usually built as a cluster of CPU or GPU nodes. Although one can build a 
deep learning system as a cluster of CPUs (e.g., Project Adam (Chilimbi et al. [2014]) or using 
the Blue Gene supercomputer (Chung et al. [2014])), a hybrid system with CPUs and GPUs 
is more suitable. Given the heavy single-precision computational training workload, GPUs are 
primarily used for accelerating the core computational kernels such as convolutions and matrix 
multiplications (Chetlur et al. [2014], van den Berg et al. [2015]). Exploitation of GPUs for 
deep learning is further enhanced as all the key deep learning programming frameworks (e.g., 
Theano, Caffe, and Torch) provide support for GPU acceleration. Given the high single-precision 
capabilities of GPUs, one can train the same sized deep learning system using a smaller cluster. For 
example, the 16 thousand node CPU cluster used by the Google Brain system is now replaced 
by a much smaller cluster of GPU and CPU nodes (Dean [2015]). Similar design approach is 


4.2. DEEP LEARNING 61 


followed by other commercial deep learning systems, e.g., the Baidu Minwa machine is built as 
a cluster of 36 nodes, each with 2 6-core Intel Xeon E5-2620 processors, 4 Nvidia K40 GPUs, 
connected by a high-performance low-latency InfiniBand network (Wu et al. [2015]). 

Although GPUs are ideal for accelerating deep learning workloads, they suffer from high 
power consumption. The power consumption is very important while designing datacenter-scale 
deep learning systems. In such scenarios, accelerating deep learning workloads via low-power 
FPGAs or specialized ASICs becomes more attractive. Although FPGAs or ASICs cannot match 
GPUs in absolute performance, they can provide comparable or even better performance per 
power (a Nvidia K40 GPU consumes around 235W while an FPGA would consume around 
25 W). Recent studies (Farabet et al. [2010], Ouyang et al. [2014], Ovtcharov et al. [2015], 
Zhang et al. [2015]) have demonstrated efficacy of using FPGAs for accelerating convolution 
neural network computations. The FPGA implementation enables (1) support for multiple layer 
configurations at run-time, (2) on-chip communication network that minimizes memory traffic 
to off-chip memory, and (3) spatially distributed array of processing elements that can scale up 
to thousands of units. In a typical FPGA implementation, input values are first loaded into a 
multi-banked input buffer. These values are then streamed to the processing elements which then 
independently perform the matrix computations in the convolution step. The results are then 
accumulated and re-routed via a specialized network-on-chip back to the input buffers for the 
next round of computations (Ovtcharov et al. [2015]). 

Another area of active research is the development of specialized accelerators for neural 
network workloads. Examples of such accelerators include DaDianNao (Chen et al. [2014]), 
NeuFlow (Pham et al. [2012]), and IBM TrueNorth (Seo et al. [2011]). Both DaDianNao апа 
NeuF low implement the dataflow versions of different stages of convolution neural networks (e.g., 
convolution and pooling) in hardware. The IBM TrueNorth, on the other hand, implements a 
network of integrate-and-fire spiking neurons that have binary outputs. The chip has 4096 pro- 
cessors, each of which has 256 integrate-and-fire spiking neurons (with binary output) each with 
256 inputs. Each processor can compute all 256 neurons 1000 times per second asynchronously, 
with the power consumption of around 100 mW. Unlike DaDianNao апа NeuF low, TrueNorth 
is designed only for classification, not training. 


Further Reading 

Basic ideas: Bengio [2009], Schmidhuber [2014]; Software Infrastructures: cuDNN 
(Chetlur et al. [2014]), Theano (Bastien et al. [2012], Bergstra et al. [2010]), Tourch7 (Col- 
lobert et al. [2011]), and Caffe (Jia et al. [2014]); Commercial Solutions: Google Brain (Le 
et al. [2012]), Google LeNet (Szegedy et al. [2014]), Clarifai (Zeiler), Deep Speech, Google 
Now and IBM Watson Speech Recognition (Dean [2015], Hannun et al. [2014], Hinton 
et al. [2012], Ng [2015], van den Berg et al. [2015]); Systems: Google Brain (Dean [2015], 
Ng [2015]), Project Adam (Chilimbi et al. [2014]), and Baidu Minwa (Wu et al. [2015]); 
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Specialized Hardware: FPGA (Farabet et al. [2010], Ouyang et al. [2014], Ovtcharov et al. 
[2015], Zhang et al. [2015], DaDianNao (Chen et al. [2012, 2014a,b], Liu et al. [2015], 
NeuF low (Pham et al. [2012]), TrueNorth (Merolla et al. [2011], Seo et al. [2011]). 


4.3 COMPUTATIONAL FINANCE 


Computational (or quantitative) finance broadly covers computational techniques that address 
key problems in financial and insurance domains. These problems can be classified into (1) Sim- 
ulation, (2) Estimation, (3) Valuation, (4) Calibration, (5) Asset liability management, and (6) 
Risk Management (Forsyth [2014], Korn [2014]). Specific examples of these problems include 
simulation (pricing) of various financial instruments (e.g., stock prices, options, interest rates, 
commodity prices); valuation of derivatives; quantitative methods for high-frequency algorith- 
mic trading; measuring the risk of a whole bank or an instrument, and managing asset liabilities 
for life insurers or pension funds. 

One of the most common computation patterns in computational finance is the use of 
stochastic differential equations (SDEs) for pricing. The most widely used model, the Black- 
Scholes model, models the price variations (paths) of a stock over the time as a geometric Brow- 
nian motion model using market parameters such as interest rates, drift, and volatility. Unfor- 
tunately, the Black-Scholes model assume constant volatility and hence, does not provide an 
accurate reflection of modern financial markets. The Heston Model extends the Black-Scholes 
model by using a second stochastic differential equation to capture stochastic volatility variations 
(Delivirias [2012]). Solving these differential equations via finite difference methods is compu- 
tationally very expensive, and in many cases, such as pricing exotic options, it is not possible to 
compute a closed form solution. In such cases, approximate approach using Monte Carlo (MC) 
simulation is employed. In the MC approach, the first step generates a set of random price paths, 
the second set calculates the associated value for each path using a payoff function, the third step 
averages these values, and discounted to a given day to compute the option value for that day. The 
MC approach usually requires a pseudo-/quasi-random number generator to generate data with 
a specific distribution (e.g., normal). Both SDE solving and MC approaches involve significant 
floating point computations and are very computationally expensive. 

The choice of accelerators for computational finance workloads depends not only on the 
computational complexities of the kernels, but also on the runtime and operational constraints. 
Running these workloads on servers or clusters of standard multi-core CPUs is not feasible either 
due to long running times or energy consumption (Christian de Schryver and Schmidt [2011], 
Wehn [2011]). Earlier, many of the modeling tasks were done in batch mode. But now, many 
of these models have to be executed in real-time. Further, many of the workloads, e.g., for algo- 
rithmic trading, require ultra-low latency, wire-speed end-to-end performance. In addition, there 
are additional reporting requirements that need to met to satisfy various regulations. Given the 
high computational requirements, computational finance workloads are ideal to be accelerated 


4.3. COMPUTATIONAL FINANCE 63 


using traditional compute-accelerators such as GPUs and Intel Xeon-Phi (Giles [2010], xcelerit 
[2014]). However, given the high power consumption of these accelerators, FPGAs - with their 
low-power footprint, are proving to be a popular choice for accelerating computational finance 
workloads (Luk [2013]). 

The most common analytical model in computational finance is the Monte Carlo simu- 
lation. The Monte Carlo approach is inherently embarrassingly parallel and can be easily paral- 
lelized. However, Monte Carlo methods require high-quality cheap source of pseudo-random 
numbers, particularly for key probability distributions: uniform, Gaussian, exponential, and log- 
normal (Thomas et al. [2009]). In some cases, the Monte Carlo method needs low-disperency 
quasi-random number generators such as Sobol sequences (Bordawekar and Beece [2015]). The 
computational characteristics of these random-number generators depend on many factors such 
as period, statistical quality and computational costs. Ideally, one would like to have a random 
number generator with high statistical quality and period, but with low computational costs. An 
approach that achieves such balance is the binary linear generator that performs binary linear 
operations (logical conjunction or exclusive disjunction) on vectors of individual bits. Such op- 
erations can be implemented using bit-wise operations such as masking, exclusive-or, and shift- 
ing. Generation of non-uniform random number generators use different approaches, e.g., in- 
version, transformation, rejection, and recursive (Thomas et al. [2009]). The first three generate 
non-uniform random numbers by consuming uniform input and operating on it, e.g., inverting 
via applying inverted Cumulative Distribution Function (CDF) to a uniform sample, transform- 
ing a fixed set of uniform samples into a fixes set of non-uniform samples using transcendental 
functions (Box-Muller transform), select a only few values from a large set of uniform numbers 
(the Ziggurat approach). The recursive approach directly generates non-uniform samples, e.g., 
the Wallace method can generate Gaussian or exponential distribution efficiently without us- 
ing any transcendental functions. However, this approach has problems with correlation between 
output sample values, hence it is not used widely. While both uniform and non-uniform random 
number generators have shown to be accelerated via CPU, GPU, and FPGAs (Matsumoto and 
Nishimura [1998], Thomas et al. [2009]), FPGAs have clear advantages due to the availability of 
very fine-grain binary linear operations (both CPUs and GPUs have word-based instructions). 
Traditionally, CPUs have been shown to accelerate both uniform and non-uniform random num- 
ber generators, e.g., the Mersenne Twister uses SIMD instructions to achieve a higher generation 
rate. While GPUs can effectively accelerate uniform random number generators, they suffer from 
the cost of branching inherent in certain non-uniform random number generators, e.g., the Zig- 
gurat approach. 

Given the ability of FPGAs to effectively generate high-quality random numbers while 
consuming low power and their affinity to process data in low-latency scenarios, FPGAs have 
the accelerator of choice in the financial industry. In practice, FPGAs have been used to ac- 
celerate option and derivative pricing (xcelerit [2013b]), high-frequency trading (Leber et al. 
[2011]), Collaterized Debt Obligations (CDOs) pricing (Kaganov et al. [2011]), real-time risk 
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management (Studnitzer and Mencer [2013]), and Credit Value Adjustment (CVA) computa- 
tions (Kaganov et al. [2011]). Recently, GPUs (Giles [2010]) and other accelerators such as Intel 
Xeon Phi (xcelerit [2014]) have been increasingly used in the financial modeling applications. Al- 
though the raw compute performance of such compute accelerators is better than FPGAs, power 
consumption and cost of invoking the accelerator functions still remain as issues for GPUs. 


Further Reading 

Basic Idea: Christian de Schryver and Schmidt [2011], Forsyth [2014], Korn [2014], 
Wehn [2011] Algorithms: Heston Model (Delivirias [2012]), CVA (Kaganov et al. [2011]), 
Pseudo-random number generators (Matsumoto and Nishimura [1998], Thomas et al. 
[2009]), Sobol Sequences (Bordawekar and Beece [2015]), CDO (Kaganov et al. [2011]); 
Accelerator Systems: SIMD exploitation (Matsumoto and Nishimura [1998], FPGA 
(de Schryver et al. [2011], Jin et al. [2011], Leber et al. [2011], Morris et al. [2009], Sadoghi 
et al. [2010], Studnitzer and Mencer [2013], Wang et al. [2013], Weston et al. [2010, 2011], 
xcelerit [2013b]), GPUs (Fricker [2015], Giles [2010], Grauer-Gray et al. [2013], Lahlou 
[2013], Lotze et al. [2012], Papamanousakis et al. [2015], xcelerit [2013a], and Intel Xeon- 
Phi (xcelerit [2014]). 
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The most common scenario of OLAP workload involves processing large amount of disk-resident 
data stored in relational tables using relational queries written in SQL (ROLAP). A related, but 
less widely used, strategy involves viewing the disk-resident data as multi-dimensional arrays 
(МОГАР). With availability of cheap, and reliable DRAM chips, main memory databases are 
becoming increasingly popular (Lahiri [2014]). While the traditional database performance is 
affected by disk I/O costs, performance of main-memory and streaming databases is affected by 
memory and networking costs. 

In OLAP workloads, the opportunities for acceleration are dependent on a variety of factors 
which include (1) type of underlying logical model (1.е., relational, multi-dimensional, or hybrid), 
(2) execution scenario (e.g., disk-based, in-memory, or streaming), (3) system implementation 
issues (e.g., row-based or columnar storage), and (4) type of function to be accelerated (e.g., query 
compilation or execution and various utility functions, e.g., sorting). 

In practice, database systems use accelerators for improving both I/O and compute costs. 
For example, the IBM Netezza system uses a combination of FPGA-based acceleration and cus- 
tomized software to optimize data-intensive mixed database and analytics workloads with con- 
current queries from thousands of users. The Netezza system uses two key principles to improve 
performance: (1) reduce unnecessary data traffic by moving processing closer to the data, and (2) 
use parallelization techniques to improve the query processing costs (Feldman [2013]). A Netezza 


4.4. OLAP/BUSINESS INTELLIGENCE 65 


appliance is a distributed-memory system with a host server connected to a cluster of indepen- 
dent servers called the snippet blades (S-Blades). A Netezza host first compiles a query using a 
cost-based query optimizer that uses the data and query statistics, along with disk, processing, 
and networking costs to generate plans that minimize disk I/O and data movement. The query 
compiler generates executable code segments, called snippets which are executed in parallel by 
S-blades. Each S-blade is a self-contained system with multiple multi-core CPUs, FPGAs, gi- 
gabytes of memory, and a local disk subsystem. For a snippet, the S-Blade first reads the data 
from disks into memory using a technique to reduce disk scans. The data streams are then pro- 
cessed by FPGAs at wire speed. In a majority of cases, the FPGAs filter data from the original 
stream using predicate evaluation, and only a tiny fraction is sent to the S-Blade CPUs for further 
processing. The CPUs then execute either database operations like sort, join, or aggregation or 
core mathematical kernels of analytics applications on the filtered data streams. Results from the 
snippet executions are then combined to compute the final result. The Netezza system can operate 
on thousands of data streams in parallel. In addition to filtering, Netezza also exploits FPGAs to 
accelerate key utility functions such as decompression. 

When the data is entirely in memory, the overall execution becomes memory-bound. To 
improve memory performance, databases implement various optimization techniques such as spe- 
cialized memory layouts such as columnar storage (Manegold et al. [2000]) and operating on 
compressed data (Raman et al. [2013]). These techniques improve the memory access locality 
and enable compute acceleration using SIMD intrinsics such as x86 АУХ2 and Power VSX. For 
in-memory databases, SIMD intrinsics have been shown to accelerate various key operations in 
query execution, e.g., computing set intersections, predicate evaluations, and aggregation calcu- 
lations (Lahiri [2014], Raman et al. [2013], Schlegel et al. [2013], Sikka et al. [2013], Willhalm 
et al. [2009b], Zhou and Ross [2002]). SIMD instructions are also applied for accelerating key 
utilities such as sorting, compression/decompression, and string processing (Inoue et al. [2007]). 
A special case of main memory databases is streaming database that operates on (potentially infi- 
nite) streams of data using a pre-defined set of queries (non-streaming databases support ad-hoc 
queries as well). Many of these queries perform filtering or matching operations that are very 
amenable for SIMD exploitation (Gedik et al. [2008], Wang et al. [2010)). 

For in-memory MOLAP systems, aggregation over large datasets is often the performance 
bottleneck. In MOLAP systems, data is viewed logically as a multi-dimensional cube and pro- 
cessed using operators that compute on regions built as a collection of cells. The MOLAP data is 
usually sparse, and stored using specialized sparse data structures. Aggregating over such sparse 
datasets often results in accessing non-contiguous (strided) data, which makes efficient exploita- 
tion of SIMD intrinsics difficult. GPUs, on the other hand, can use their massive data-parallelism 
capabilities and high-memory bandwidth to parallel aggregation of strided data. An example of 
this approach is the Jadox Palo in-memory MOLAP engine (Strohm [2015]) which uses GPUs 
for aggregation over large strided datasets. Recently, similar GPU techniques have been explored 
for accelerating OLAP queries on relational data (Wu et al. [2014Ъ]). 
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Recently, there has been renewed interest in exploring specialized hardware acceleration 
support for database operations. Key examples of such approaches include Sonoma (Vinaik and 
Puri [2015]), HASHI (Arnold et al. [2014]), and Q100 (Wu et al. [2014c]). The Oracle SPARC 
Sonoma processor incorporates a novel Database Accelerator (DAX) that operated on in-memory 
decompressed and compressed columnar vectors in streaming manner. The DAX architecture also 
included new SIMD instructions designed to accelerate core database operations. The HASHI 
approach describes new instruction set instructions for speeding up 32-bit hash functions for 
integer and string keys. The Q100 proposal describes an architecture of a DataBase Processing 
Unit (DPU), built as a collection of heterogeneous ASIC tiles that process relational tables quickly 
and with low power. 


Further Reading 

Algorithms and Software Solutions: Exploiting SIMD for accelerating database kernels 
(Schlegel et al. [2013], Willhalm et al. [2009b], Zhou and Ross [2002]); Systems: Netezza 
(Feldman [2013]), Oracle (Lahiri [2014]), SAP Hana (Sikka et al. [2013], DB2 BLU (Ra- 
man et al. [2013]), MonetDB (Manegold et al. [2000], InfoSphere Streams (Gedik et al. 
[2008]); Specialized Hardware: Oracle Sonoma (Hetherington [2015], Vinaik and Puri 
[2015], Q100 (Wu et al. [2014c]), HASHI(Arnold et al. [2014]). 
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Similar to text analytics, graph analytics workloads can also be partitioned into two classes based 
on their computational patterns. The first approach navigates the graph explicitly and is used for 
addressing pattern matching and traversal algorithms. The second approach uses sparse matrix- 
based linear algebraic solutions for solving structural graph analytics problems. Both approaches 
result in a large number of high-latency small non-contiguous memory accesses, making graph 
analytics a memory-bound problem, with very limited spatial and temporal memory localities. A 
way of accelerating memory-bound problems is to use processors which support massive multi- 
threading with high memory bandwidth, and thus, can effectively hide or tolerate memory laten- 
cies by overlapping computation and memory accesses of multiple threads. 

Examples of such processors include the Cray Threadstorm processor and GPUs. Те Cray 
Threadstorm processor is a massively multi-threaded processor that is used to power the Cray 
XMT system (Kopser and Vollrath [2012]). The Cray XMT is a distributed shared memory sys- 
tem built as a cluster of 4 Thunderstorm processors, connected to a very high-speed interconnect 
organized as a torus. Each cluster node can have upto 64 GB of memory and the overall system 
can be scaled to multiple TByte of main memory. From an user’s perspective, the XMT appears 
as a single processor with a large number of threads operating in a shared address space. On the 
Cray XMT, threads are lightweight software objects that are mapped onto hardware streams. 
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A stream has its very small register state and executes its instructions independently. Typically, 
the compiler generates many more threads than the number of streams in the machine that are 
then multiplexed onto the hardware streams. Тһе small per-thread state allows lightweight con- 
text switching at every instruction cycle. The uRiKA graph analytics system is built using the 
Cray XMT infrastructure (Maltby [2012]). The uRiKA system is designed specifically to accel- 
erate queries on in-memory RDF (Resource Description Framework) databases. RDF is a W3C 
standard, designed to enable semantic web searching and integration of disparate data sources. 
The Semantic Web is a graph that captures relationships between entities using triplets (sub- 
ject, predicate, object) where each triplet is essentially an edge (predicate) connecting two nodes 
(subject and object). The RDF representation is used extensively to represent knowledge graphs, 
e.g., biological network graphs. The RDF databases are queried by a specialized query language 
called SPARQL which enables matching of graph patterns the RDF databases. On the uRiKA 
system, the SPARQL queries are parallelized by the compiler into multiple small queries which 
are then executed by the underlying Cray XMT system. The uRiKA system has been used exten- 
sively to accelerate semantic web workloads on very large graphs, e.g., large-scale data mining of 
pharmacological, chemical, and biological semantic graphs (Henschel et al. [2014]). 

The Single-Instruction Multiple-Thread (SIMT) execution model of the Nvidia GPUs is 
very similar to the multi-threaded model supported by the Cray Thunderstorm processor. ‘The cur- 
rent versions of Nvidia GPUs exhibit memory bandwidth over 280 GB/s and can support millions 
of threads. Over the years, several traversal-based graph analytics algorithms have been ported 
on the Nvidia GPU, e.g., Breadth-First Search, Single-source Shortest Path, All-pairs Short- 
est Path, and Minimum-Spanning Tree (Harish and Narayanan [2007], McLaughlin and Bader 
[2014], Merrill et al. [2012]). There are multiple graph analytics libraries that provide efficient 
implementations of these and other graph analytics algorithms (e.g., Gunrock and MapGraph; 
Fu et al. [2014], Wang et al. [2015]). These implementations represent graphs using array-based 
data structures such as adjacency list, v-graph (vector-graph; Blelloch [1990]), or structure of ar- 
rays, and use either the BSP (Bulk Synchronous Programming) or GAS (Gather-Access-Scatter; 
He et al. [2007]) approaches to navigate the graphs. Although these implementations suffer from 
lack of spatial or temporal memory localities, they are aided by many of the GPU’s architectural 
features such as massive multi-threading via SIMT, effective thread scheduling, large register files 
and shared memory, and texture memory/read-only caches. In particular, GPU’s texture mem- 
ory provides hardware support for improving performance of non-contiguous memory accesses. 
Unfortunately, all the current implementations work on the graph datasets that can fit GPU’s 
device memory. Development of scalable multi-GPU out-of-core graph traversal algorithms is 
not trivial, and is an area of active research (Wang and Owens [2013]). 

The second approach for implementing graph analytics involves operating on sparse ma- 
trix representations of the input graphs. The linear algebraic approach is suitable for computing 
various structural properties of a graph, e.g., the betweenness centrality which uses algorithms to 
compute eigenvectors of the graph matrix (e.g., PageRank; Bryan and Leise [2006], Mahoney). 
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‘The key kernels used by this approach include sparse matrix-vector multiplication (SpMV) and 
sparse matrix-dense matrix multiplication (csrmm). Efficient implementations of these and other 
related kernels are available both on CPU and GPU platforms (e.g., cuSPARSE, Intel MKL, 
and IBM ESSL libraries). As sparse matrix computations are also memory bound, performance 
of these kernels on GPUs is significantly better than on current generation of multi-core CPUs 
(Yang et al. [2011]). Recently, a specialized graph processor architecture (Song et al. [2013]) was 
proposed to address the key weaknesses in the traditional CPU designs. The proposed graph pro- 
cessor is a specialized parallel processor that uses a new instruction optimized for sparse matrix 
operations. The graph processor is built as an array of sparse matrix processors (called node pro- 
cessors) connected via high-bandwidth three-dimensional communication fabric. Large sparse 
matrices are distributed over these node processors and operated using the new instruction set. 
Each node processor has cache-less local memory. All data computations, indices-related compu- 
tations, and memory operations are handled by specialized accelerator modules rather than by the 
central processing unit. The processors use new message-routing algorithms that are optimized 
for communicating very small packets of data such as sparse matrix elements or partial products. 
While current performance estimates are based on simulation, the specialized graph processor 
direction is very promising and deserves further investigation. 


Further Reading 

Algorithms and Software Solutions: Graph algorithms on GPUs (Harish and Narayanan 
[2007], Harish et al. [2009], McLaughlin and Bader [2014], Merrill et al. [2012], Yang et al. 
[2011], Page Rank (Bryan and Leise [2006]), Gunrock (Wang et al. [2015]), MapGraph 
(Fu et al. [2014]), v-graph (Blelloch [1990]), GAS (He et al. [2007]); Systems: Cray XMT 
(Kopser and Vollrath [2012]), uRIKA (Henschel et al. [2014], Maltby [2012]); Specialized 
Hardware: Song et al. [2013]. 
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CHAPTER. 5 | 


Architectural Desiderata for 
Analytics 


In the previous chapters, we analyzed behavior of existing analytics workloads, examined their 
computational and runtime patterns using analytics exemplars, and discussed various acceleration 
opportunities. In this chapter, we discuss future trends in analytics workloads and its impact on 
designing future processors and systems specifically targeted for analytics. 

Within the span of a few years, the analytics applications have moved from being tools of 
convenience to being essential tools. Advances in systems, software, and hardware have drastically 
changed the usage and types of analytics workloads, as outlined here. 


e Widespread availability of cheap and reliable internet services has enabled the /ow-power 
mobile devices to become the dominant platform for consuming analytics applications and 
as well as for generating data for analytics workloads. With the emergence of Internet- 
of-Things, e.g., home thermostat, and wearable devices, this trend is going to continue in 
foreseeable future. Further, analytics on mobile devices would become more common-place 
in enterprise scenarios (e.g., point-of-sale devices). 


• Data being consumed or generated by the analytics workloads is accessed primarily from 
virtualized cloud resources. Cloud enables scalable user access to ever-increasing data repos- 
itories. While the cloud infrastructure has enabled easy sharing of data for analytics work- 
loads, it has created serious issues with data privacy and security. 


e Advances in software-hardware co-design has enabled complex resource-intensive multi- 
modal applications to become ubiquitous, e.g., Intelligent Personal Assistants (IPAs) that 
use inputs such as voice, vision, and contextual (e.g., spatial coordinates) information to 
provide answers in natural languages (Hauswald et al. [2015]). As a consequence, it has 
opened up new domains for exploiting analytics, e.g., personalized healthcare, or home and 
car automation. 


e Convergence of mobile and social domains has lead to increasing use of spatial and temporal 
analytics workloads that often require executing analytics queries in (near) real-time. For 
example, one can imagine a trip scheduler that takes voice input and generates personal- 
ized travel itinerary in real-time based on specified constraints (i.e., price, time, etc.) and 
historical travel data. 
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e Analytics workloads are increasing using operations that overlap the traditional analyt- 
ics, high-performance computing, and data management boundaries. For example, current 
state-of-the-art techniques for parallelizing the gradient descent algorithm borrow heavily 
from parallel computing and distributed data management approaches (Niu et al. [2011]). 


e End-to-end workflow of analytics workloads exhibit components with different runtime 
constraints: a front-end component executing on low-power devices generating streams of 
requests to the back-end components processes these requests either purely in-memory or 
in out-of-core manner. Some of the components may be executing on a shared memory 
system, some on a distributed cluster or some may be executing on a cloud environment. 
Thus, the end-to-end workload would use multiple acceleration strategies with different 
characteristics. 


5.1 ACCELERATORS FOR ANALYTICS WORKLOADS 


To address the requirements discussed in the previous chapter, it is important to review currenly 
available accelerator options and if they can satisfy requirements of current and future analytics 
workloads. 

Broadly, analytics accelerators can be viewed as a hierarchy built using a set of foundational 
accelerators. Foundational accelerators are designed to accelerate core execution functions, such 
as, compute, memory, I/O, and networking. Examples of foundational accelerators include, com- 
pute accelerators such as SIMD engines, GPUs, and FPGAs; memory accelerators such as GPU 
texture memory or Micron’s Active memory (Kirsch [2003]); networking accelerators such as net- 
work processors or RDMA-based accelerators (Lu et al. [2014]); and storage accelerators such 
as active storage using non-volatile memory devices (Fitch [2013]). The foundational accelera- 
tors can be used to build higher-order accelerators, namely functional, data-structure, kernel, and 
workload accelarators (Table 5.1). These accelerators can also be characterized by their location 
and execution patterns. The accelerators can be co-located on the same die as the host processor 
(e.g., SIMD) ог can be connected to the processor via an externel interface such as PCI-E or 
connected directly to the memory subsystem (e.g., Micron’s Yukon; Kirsch [2003]), or connected 
directly on the network (e.g., using network processors), or connected to storage sub-systems 


(e.g., BlueGene Active Storage; Fitch [2013]). 


• Functional Accelerators: Functional accelerator accelerate specific operations or functions. 
As observed in Table 3.2, the analytics exemplars have several common functions that can 
be accelerated. For example, regular expression evaluation for pattern matching (Tutomu 
Murase and Kuriyama [2000]), compression/de-compression, encryption-decryption, etc. 
Table 5.1 presents a list of functional accelerators and how they can be implemented. While 
a number of functional accelerators can be implemented via foundational accelerators (e.g., 
support for cryptography in Intel’s АУХ2 instruction set). However, functions such as dis- 
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Table 5.1: Classification of analytics accelerators 






































Accelerator Type Implementation Location 

Pattern/Regular Expression 

Matching Functional CPU, GPU On/Off-chip, On-Network 
Compress and Decompress FPGA Near-storage, Near-memory 
Encryption and Decryption 

Bit-vector Processing Functional CPU, FPGA On/Off-chip 

Distance Metric Functional CPU, GPU On/Off-chip, Near-memory 
Streaming Processing Functional CPU, FPGA On/Off-chip, On network 
Kernel Functions Functional CPU, GPU On/Oft-chip, Near-memory 
Hash Tables 

Bloom Filters 

K-d, R, Binary 

Prefix/Suffix Trees Data-structure | СРО, GPU, FPGA | Near-memory, Near-storage 





Key-value Pairs 
Index (B-Tree, Inverse) 
Dense and Sparse Matrices 















































FFT and Convolution Kernel GPU, FPGA On/Off-chip 
Near-memory, Near-storage 

Sorting Kernel CPU, GPU, FPGA | Near-memory, Near-storage 

Matrix Computations Kernel CPU, GPU Near-memory, Near-storage 

Near-neighbor Search 

Random Number Kernel CPU, GPU, FPGA | On/Off-chip 

Тор-К Processing Kernel CPU, GPU Near-memory, Near-storage 
On network 

Visualization Workload GPU On/Off-chip, Near-memory 
Near-storage, On network 

Graph Traversal Workload GPU, FPGA Near-memory, Near-storage 

Security Workload FPGA Near-memory, Near-storage 

OLAP Workload CPU, GPU, FPGA | Near-memory, Near-storage 

Neural Networks Workload GPU Near-memory 

Financial Workload GPU, FPGA On/Off-chip, On network 

Bio-informatics Workload GPU, FPGA Near-memory, Near-storage 























tance computations, e.g., root-mean square error between two vectors; kernel functions such 
as sigmod functions, are not supported in hardware. 


e Data-structure Accelerators: Foundational accelerators can be also used to improve perfor- 
mance of operations on key data structures such as hash-tables, dense and sparse matrices, 
bloom filters, and a variety of trees which include B-trees, K-d, and R-trees. Examples of 
non-matrix data structure accelerators include hash table acceleration on GPUs (Alcantara 
[2011]), bloom filter acceleration on FPGAs (Dharmapurikar et al. [2004]), and K-d trees 
on GPUs (Foley and Sugerman [2005]). There is excellent support for accelerating vari- 
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ous computations on dense and sparse matrices on GPUs: sparse matrix computations are 
memory bound and can make extensive use of texture memory support on GPUs. Recent 
studies have investigated alternative strategies for accelerating sparse matrix accesses, e.g., 
using FPGAs (Fowers et al. [2014]) or via specialized architecture (Song et al. [2013]). 


Kernel Accelerators: The functional and data-structure accelerators can be used to build 
accelerate individual kernels. Examples of kernels that can be accelerated include: (1) sort- 
ing, whose comparator and exchange functions can be accelerated by functional accelerator, 
e.g., using SIMD instructions or via GPU (Inoue et al. [2007], Merrill and Grimshaw 
[2011]), (2) various matrix computations, e.g., BLAS and sparse matrix functions, (3) FFT 
and convolution, (4) random number generators, and (4) analytical kernels such as k-Means 
clustering. Most current CPUs and GPUs support highly tuned libraries to support many 
of these kernels (e.g., Intel MKL, IBM ESSL, and Nvidia’s cuBLAS, cuSPARSE, and 
cuRAND libraries). 


Workload Accelerators: Final type of accelerator accelerates execution of workloads using 
functional, data-structure, and kernel accelerators. Examples of workload accelerators in- 
clude visualization, XML processing, security, financial, neural networks, OLAP, and bio- 
informatics. Such workload accelerators can be built either using FPGAs or GPUs (Putnam 
et al. [2014], Wu et al. [2014а]). Аз we discussed in Chapter 4, there has also been signif- 
icant interest in developing specialized hardware implementations of key workloads, e.g., 
neural network-based computations, e.g., NeuF low and TrueNorth processors (Pham et al. 
[2012], Seo et al. [2011]), OLAP (Wu et al. [2014c]), and XML processing (Mitra et al. 
[2009b)). 


It is clear that while existing processor architectures and systems are able to execute a few 


analytics workloads, they still lack capabilities to effectively accelerate certain key analytics com- 
putational patterns (e.g., sparse matrix computations) and to support additional requirements of 
the upcoming analytics workloads. Specifically we have the following. 


e Supporting efficient non-contiguous memory accesses: [he most common computational 
pattern observed across multiple analytics workloads is the non-contiguous memory access 
pattern caused by computations on sparse matrices, high-dimensional sparse data, or graphs. 
Current CPU memories assume a linearized layout which leads to inefficient memory be- 
havior. Specialized memory designs such as texture memory supported by GPUs provide 
two-dimensional memory accesses, but can not sufficiently address random access require- 
ments of graph analysis workloads. Recently, there has been some interest in designing 
processors specifically for graph analytis workloads, e.g., the Graph Processor Architec- 
ture from Song et al. [2013]. However, traditional CPU still lack capabilities to efficiently 
support non-contiguous memory accesses. 
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* Supporting big data computations: In practice, analytics workloads operate on а wide vari- 
ety of large datasets, both persistent and streaming. As we observed, a majority of analytics 
workloads extract useful information from the input data, and compute results that are much 
smaller than the input (exception being OLAP and association rule mining). The amount of 
computation is at least O(N) complexity, which means that big data often translates to big 
compute. Traditional data-intensive algorithms were designed to minimize the number of 
I/O accesses, but not to optimize the amount of data to be moved. However, with Peta- and 
Exa-byte data sizes, the cost of moving data to compute cores has become dominant. A way 
to address this problem is to bring compute functions closer to the data sources (e.g., disks, 
network streams). Such active computations cannot only reduce number of I/O accesses, but 
can also reduce the total amount data moved (Acharya et al. [1998], Riedel et al. [1998], 
Uysal et al. [2000]). Recent studies (Fitch [2013], Kirsch [2003]) have explored opportuni- 
ties for active storage in the context of DRAMs and non-volatile memories. However, far 
more work is required for building a usable ecosystem for active computing. 


* Supporting approximate computations: Approximate computing is an emerging paradigm 
that enables building efficient hardware and software implementations by exploiting im- 
plicit resilience of applications toward in-exactness of their underlying computations 
(Chippa et al. [2013], Nair [2014]). As we have observed in the previous sections, for a 
wide variety of analytics models, specifically in data mining and machine learning domains, 
e.g., neural networks, nearest-neighbor search, clustering, support vector machines, etc., 
relative properties of intermediate or final computations are more important than abso- 
lute values. In such cases, computations done іп an approximate manner or using lower- 
precision arithmatic—as long as they preserve the relative ordering—have little or no im- 
pact on the output results (Gupta and Gopalakrishnan [2014], Gupta et al. [2015], Mishra 
et al. [2014]). Approximate implementations of algorithms can reduce the amount of com- 
putation and improve memory performance by reducing memory footprint and improving 
memory utilization. These improvements can lead to significant performance improvements 
and lower energy consumption. 


• Support for multi-tenant execution: A consequence of executing in a cloud environment is 
that the underlying system (computing, networking, and I/O) resources are used for mul- 
tiple types of workloads (e.g., analytics, data management, and high performance comput- 
ing). To effectively serve these different workloads simultaneously, the underlying system 
resources need to shared fairly. For analytics workloads, where accelerators are being in- 
creasing used for improving performance of specific kernels, virtualization brings additional 
challenges. At present, support for virtualizing acceleration resources is still primitive. 
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5.2 BRINGINGIT ALL TOGETHER: BUILDING AN 
ANALYTICS SYSTEM 


So far we have discussed computational and runtime characteristics of different analytics work- 
loads, and presented different strategies for accelerating these workloads. The key questions that 
is still unanswered is how one can integrate these acceleration solutions into a single analytics 
system. 

As we have observed so far, a majority of analytics workloads exhibit characteristics simi- 
lar to classical high-performance computing (HPC) workloads and traditional data management 
systems. Figure 5.2 presents a view of how models from these three domains interact with each 
other. Both HPC and analytics workloads use mathematical formulations to solve the problem 
at hand (specifically, both approach make extensive use of linear algebraic kernels.) Some of the 
analytics models, e.g., modeling, and graph analytics, are widely used in the HPC context. Fur- 
ther, HPC software infrastructure can be used for implementing scalable analytics algorithms, 
e.g., using MPI as a communication layer. However, analytics and HPC workload differ in a key 
aspect: a majority of HPC applications have a single domain-specific focus (e.g., seismic pro- 
cessing, cosmological simulations, or computational fluid dynamics). Such HPC applications can 
be viewed as information extraction processes that have a single workflow designed to address 
domain-specific functional and runtime goals. The application domain determines the runtime 
characteristics, e.g., input/output data representations, data layout, and sizes. HPC applications 
also exhibit compute-intensive behavior for in-memory data. Unlike HPC applications, most 
analytics applications have multi-domain focus and support several independent workflows with 
potentially different domain-specific functional and runtime goals. Each workflow could have 
different computational and I/O characteristics. Thus, an analytics application can be viewed as 
an information integration process. Similar similarities and differences exist between data manage- 
ment and analytics workloads. Both analytics and data management share the on-line analytical 
processing (OLAP) model; unstructured and semi-structured data processing (e.g., XML, RDF, 
or natural text) share algorithms with text and graph analytics models; and data management over 
streaming data has similarities with time-series processing. Also, many analytical workloads use 
data stored in relational databases as the primary source for input. The main difference between 
data management and analytics workloads is transactional processing that involves management 
of concurrent update operations. By default, analytics workloads operate on read-only data. Un- 
like analytics workloads, performance of traditional data management systems is always affected 
by data access costs (from disks, memory, or network). In addition, transactional management 
systems often use specialized disk layout and index structures (e.g., B+ trees) that are not needed 
by analytics workloads. Finally, all three domains share the visualization component, which can 
be used for viewing input, intermediate data, and results. 

‘Thus, an ideal analytics system should have the following key characteristics: 


• a heterogeneous scale-out architecture that supports different types of compute, memory, 
and I/O devices, and the ability to flexibly choose resources for a given workload; 
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Figure 5.1: Relationships between analytics, high-performance computing, and data management 
models. 


e focus on information integration, not just computational performance. In addition to sup- 
porting matrix (sparse and dense) computations, the system should also provide support for 
data-structures such as hash tables, high-dimensional trees, linked-lists, etc.; 


e tighter integration with the data-centric ecosystem, e.g., data warehouses, text repositories, 
stream processing systems, etc.; 


• balanced support for computation, memory, networking, and I/O; and 
* should be able to support multiple analytics workloads (multi-tenancy). 


Clearly, a traditional scalable HPC system designed for accelerating FLOP-intensive work- 
loads may not be a good analytics system. In contrast, a well-designed analytics system can serve 
as an HPC system. At present, there are a few scale-out systems that are specialized for individual 
domains, e.g., Netezza for OLAP (Feldman [2013]) and Google Brain (Le et al. [2012]), Baidu 
Minwa (Wt et al. [2015]), and Project Adam (Chilimbi et al. [2014]) for deep learning. How- 
ever, these systems are not flexible enough to provide effective support for analytics workloads. 
The problem of designing a scalable, flexible, multi-tenant analytics system is open and needs 
further investigation. 


Further Reading 
Accelerators: Dharmapurikar et al. [2004], Farroukh et al. [2011], Fitch [2013], Fo- 
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APPENDIX A 


Examples of Industrial Sectors 
and Associated Analytical 


Solutions 


1. Financial Services 
e Core Banking: Customer Insight, Product Recommendation, Fraud Detection and 
Prevention, Underwriting, KYC, Credit Scoring 
e Payments: Fraud Detection and Prevention, Anti-Money Laundering, Underwriting 


• Financial Markets: Pricing, Risk Analysis, Fraud Detection and Prevention, Portfolio 
Analysis, Product Recommendation, Merger and Acquisition Analytics 


* Insurance: Risk Analysis, Cause and Effect Analysis, Underwriting, Claims Analysis 
e Financial Reporting: Revenue Prediction, Regulatory/Compliance Reporting, Score- 
card/Performance Management 
2. Healthcare 
* Drug Interactions, Disease Management, Preliminary Diagnostic Analysis, BioMed- 
ical Statistics 


e Healthcare Payer: Insurance Fraud, Clinical Cause and Effect, Medical Record Man- 
agement, Network Management Analytics 


* Healthcare Provider: Employer Group Analytics, Patient Access Management, Clin- 
ical Resource Management, Patient Throughput, Quality and Compliance 


3. Retail: Promotions, Inventory Replenishment, Shelf Management, Demand Forecasting, 
Price and Merchandising Optimizations, Real-estate optimizations, Workforce Efficiency 
Optimizations 


4. Manufacturing: Supply Chain Optimizations, Demand Forecasting, Inventory Replenish- 
ment, Warranty Analysis, Product Customization, Product Configuration Management 


5. Transportation: Scheduling, Routing, Yield Management, Traffic Congestion Analysis 


78 А. 


16. 


EXAMPLES OF INDUSTRIAL SECTORS AND ASSOCIATED ANALYTICAL SOLUTIONS 


. Hospitality: Pricing, Customer Loyalty Analysis, Yield Management, Social-media mar- 


keting, Workforce Scheduling 


. Energy: Trading, Supply-demand Forecasting, Compliance, Network Optimizations 


. Communications: Price Plan Optimizations, Customer Retention, Demand Forecasting, 


Capacity Planning, Network Optimizations, Customer Profitability 


. Integrated Supply Chain: Scheduling, Capacity Planning, Demand-Supply Matching, Lo- 


cation Analysis, Routing, 


. Marketing and Sales Analytics: Customer Segmentation, Co-joint Analysis, Lifetime Value 


Analysis, Topic/Trend Analysis, Market Experimentation, Yield (Revenue) Analysis 


. Legal Analytics: eDiscovery, Identification, Collection, Record Management 


. Customer Analytics: Customer Retention/Attraction, Pricing Optimizations, Brand Man- 


agement, Customer Life Cycle Management, Customer-specific Content Specialization, 


Call Center Voice Analytics 


. Life Sciences: Gene Pool Analysis, Drug Discovery, BioInformatics 


. Human Resources: Churn Analysis, Talent Management, Benefits Analysis, Workforce 


Placement Optimizations, Call Center Staffing, 


. Government: Fraud Detection, Crime Prevention Management, Revenue Optimizations, 


Tax Compliance and Recovery Strategies, Transportation Planning 


eCommerce: Web Metrics, Web-page Visitor Analysis, Customizable Site Designs, Cus- 
tomer Recommendations, Online Advertisements, Online Trend Analysis, BLOG Analysis 
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